计算机应用

• 数据库与人工智能 • 上一篇    下一篇

汉字关联性量化方法及其在文本相似性分析中的应用

赵彦斌   

  1. 华中科技大学计算机学院
  • 收稿日期:2005-12-19 修回日期:2006-02-21 出版日期:2006-06-01 发布日期:2006-06-01
  • 通讯作者: 赵彦斌

Chinese character association measurement method and its application on Chinese text similarity analysis

Zhao Yanbin   

  1. 华中科技大学计算机学院
  • Received:2005-12-19 Revised:2006-02-21 Online:2006-06-01 Published:2006-06-01
  • Contact: Zhao Yanbin

摘要: 文本相似性分析、聚类和分类多基于特征词,由于汉语词之间无分隔符,汉语分词及高维特征空间的处理等基础工作必然引起高计算费用问题。探索了一种在不使用特征词的条件下,使用汉字间的关系进行文本相似性分析的研究思路。首先定义了文本中汉字与汉字之间关系的量化方法,提出汉字关联度的概念,然后构造汉字关联度矩阵来表示汉语文本,并设计了一种基于汉字关联度矩阵的汉语文本相似性度量算法。实验结果表明,汉字关联度优于二字词词频、互信息、T检验等统计量。由于无需汉语分词,本算法适用于海量中文信息处理。

关键词: 汉字关联度, 信息矩阵, 文本相似度算法

Abstract: The research of text similarity analysis and text clustering is mostly based on feature words. Because Chinese text does not have a natural delimiter between words, it must solve two problems such as Chinese word segmentation and higher-level dimensions feature vector spaces. In order to reduce the higher complexity, a novel investigation method of text similarity analysis using the association of Chinese characters was probed without using feature words. The notation of Chinese Character Association Measurement was defined, and the Chinese Character Association Measurement matrix to represent the Chinese text documents was constructed. Then a Chinese text similarity algorithm based on Chinese Character Association Measurement Matrix is proposed. The experiment result shows the Chinese Character Association Measurement is better than the mutual information and the T test and the bi-gram frequency. Without Chinese word segmentation, so this algorithm is useful in massive Chinese data corpus.

Key words: Chinese Character Association Measurement(CCAM), information matrix, text similarity measurement algorithm