计算机应用 ›› 2020, Vol. 40 ›› Issue (4): 1069-1073.DOI: 10.11772/j.issn.1001-9081.2019091540

• 数据科学与技术 • 上一篇    下一篇

基于共识和分类改善文档聚类的识别信息方法

王留洋, 俞扬信, 陈伯伦, 章慧   

  1. 淮阴工学院 计算机与软件工程学院, 江苏 淮安 223003
  • 收稿日期:2019-09-05 修回日期:2019-10-23 出版日期:2020-04-10 发布日期:2019-11-05
  • 通讯作者: 王留洋
  • 作者简介:王留洋(1974-),男,江苏淮安人,副教授,硕士,主要研究方向:信息管理与信息系统、智能化信息处理、大数据挖掘;俞扬信(1970-),男,江苏泰州人,教授,硕士,主要研究方向:信息管理与信息系统、智能化信息处理、知识组织;陈伯伦(1986-),男,江苏淮安人,副教授,博士,主要研究方向:复杂网络的链路预测;章慧(1970-),女,江苏南通人,教授,硕士,主要研究方向:信息管理与信息系统、智能化信息处理。
  • 基金资助:
    国家自然科学基金资助项目(61602202)。

Discrimination information method based on consensus and classification for improving document clustering

WANG Liuyang, YU Yangxin, CHEN Bolun, ZHANG Hui   

  1. Faculty of Computer & Software Engineering, Huaiyin Institute of Technology, Huai'an Jiangsu 223003, China
  • Received:2019-09-05 Revised:2019-10-23 Online:2020-04-10 Published:2019-11-05
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China(61602202).

摘要: 不同的聚类算法用于设计各自的策略,然而,每种技术在执行特定数据集时都有一定的局限性。选择恰当的识别信息方法(DIM)可确保文档聚类的进行。针对这些问题提出一种基于共识和分类的文档聚类(DCCC)的DIM。首先,选择识别信息最大化聚类(CDIM)作为数据集生成初始聚类的解决方法,并使用两种不同的CDIM方法生成两个初始聚集;其次,使用不同的参数方法对两初始聚集再进行初始化,通过簇标签信息间的关系建立共识,最大限度地提高文档的识别数总和;最后,选择识别文本权重分类(DTWC)作为文本分类器给共识分配新的簇标签,通过训练文本分类器更改基础分区,并根据预报标签信息生成最后的分区。采用8个网络数据集进行实验,选择BCubed的精度和召回率指标进行聚类验证。实验结果表明,所提出的共识分类方法的聚类结果优于对比方法的聚类结果。

关键词: 共识聚类, 文档聚类, 识别信息, 簇标签, 文本分类器

Abstract: Different clustering algorithms are used to design their own strategies. However,each technology has certain limitations when it executes a particular dataset. An adequate choice of Discrimination Information Method(DIM)can ensure the document clustering. To solve these problems,a DIM of Document Clustering based on Consensus and Classification (DCCC) was proposed. Firstly,Clustering by DIM (CDIM) was used to solve the generation of initial clustering for dataset,and two initial cluster sets were generated by two different CDIMs. Then,two initial cluster sets were initialized again by different parameter methods,and a consensus was established by using the relationship between the cluster label information,so as to maximize the sum of documents' discrimination number. Finally,Discrimination Text Weight Classification(DTWC)was chosen as text classifier to assign new cluster label to the consensus,the base partitions were altered by training the text classifier,and the final partition was obtained based on the predicted label information. Experiments on 8 network datasets for clustering verification by BCubed's precision and recall index were carried out. Experimental results show that the clustering results of the proposed consensus and classification method are superior to those of comparison methods.

Key words: consensus clustering, document clustering, discrimination information, cluster label, text classifier

中图分类号: