• •    

基于句群的自动文摘方法研究

王荣波1,张璐瑶2,李杰3,黄孝喜4,周昌乐5   

  1. 1. 杭州电子科技大学计算机学院计算机应用技术研究所
    2. 浙江省杭州市下沙经济开发区杭州电子科技大学
    3. 华为技术有限公司杭州研究所
    4. 杭州电子科技大学计算机学院
    5. 厦门大学计算机科学系
  • 收稿日期:2015-09-14 修回日期:2015-11-09 发布日期:2015-11-09
  • 通讯作者: 张璐瑶

An Automatic Abstract Method Based on Chinese Sentences Grouping

  • Received:2015-09-14 Revised:2015-11-09 Online:2015-11-09
  • Contact: Lu-Yao ZHANG

摘要: 针对目前多数基于句子或段落作为处理单元的自动文摘方法,本文提出一种基于句群的自动文摘方法。该方法引用了一种基于多元判别分析(Multiple Discriminant Analysis,MDA)的汉语句群自动划分理论,通过获得句间语义更好的句群作为自动文摘的处理粒度,在此基础上使用潜在狄利克雷分配(Latent Dirichlet Allocation,LDA)主题模型将文本表示成向量矩阵,再使用K-Means算法对向量进行聚类,然后按照一定比例从聚类后的类别中抽取生成文摘,最后采用Kappa检验和肯德尔相关系数评价摘要的质量。实验结果表明本文采用的方法得到的整体Kappa值达到了0.7、肯德尔相关系大于0.8,两个评价指标结果都高于各自较好等级的评价值,因此实验结果表明以句群作为处理粒度的自动文摘方法较传统的以句子作为处理粒度的方法能生成质量更好的文摘。

关键词: 自动文摘, 句群, 主题模型, 聚类

Abstract: At present, sentences or paragraphs are considered as a processing unit in most automatic abstracting models. In this paper, an automatic abstracting method was proposed based on sentences grouping. This method adapts an automatic Chinese sentences grouping theory based on MDA. The obtained sentences groups contained better semantic information which was more suitable as a processing unit in automatic abstracting. At the same time, one text was represented as a vector matrix by using the LDA topic model and clustering operation was processed using K-Means algorithm. Then the candidate abstract was generated from clustering results according to some proportions. Finally the obtained abstracts were evaluated by Kappa statistics and Kendall related coefficient. The experimental results show that the overall Kappa value reaches 0.7 and the Kendall related coefficient is more than 0.8, which are all higher than those of the respective good grades by using the method. So the automatic abstracting based on sentences grouping can generate better results compared with the traditional methods which considers sentences as processing granularity.

Key words: Automatic Abstracting, Sentences Grouping, Topic Model, Clustering

中图分类号: