Indirect spectral clustering towards large text datasets
HOU Hai-xia1,YUAN Min-min2,LIU Chun-xia3
1. Department of Computer Engineering, Taiyuan University, Taiyuan Shanxi 030032,China 2. Department of Information Engineering, Shanxi Conservancy Technical College, Yuncheng Shanxi 044000,China 3. College of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan Shanxi 030024,China
Abstract:To alleviate the computational bottleneck of spectral clustering, in this paper a general ensemble algorithm, called indirect spectral clustering, was developed. The algorithm first grouped a given large dataset into many overclusters and then regarded each obtained overcluster as a basic object. And then the standard spectral clustering ran at this object level. By convention, when applying this new idea to large text datasets, the cosine distance would be the appropriate manner in measuring the similarities between overclusters. The empirical studies on 20-Newgroups dataset show that the proposed algorithm has a 14.72% higher accuracy on average than the K-Means algorithm and has a 0.88% lower accuracy than the normalizedcut spectral clustering. However, the proposed algorithm saves 16.8 times computation time compared to the normalizedcut spectral clustering. In conclusion, with the increase of data size, the computation time of the normalizedcut spectral clustering might become unacceptable; however, the proposed algorithm might efficiently give the nearoptimal solutions.