计算机应用 ›› 2010, Vol. 30 ›› Issue (06): 1661-1663.

• 软件过程技术与中文信息处理 • 上一篇    下一篇

基于聚团词的大规模文本转载识别算法

张京阳1,张华平2,刘金刚3   

  1. 1. 北京中科天玑信息技术有限公司
    2.
    3. 首都师范大学 计算机科学联合研究院
  • 收稿日期:2009-12-15 修回日期:2010-02-10 发布日期:2010-06-01 出版日期:2010-06-01
  • 通讯作者: 张京阳
  • 基金资助:
    国家863高新技术研究项目

Large-scale document forward detection algorithm based on agglomerate-term

  • Received:2009-12-15 Revised:2010-02-10 Online:2010-06-01 Published:2010-06-01
  • Contact: ZHANG Jing-Yang

摘要: 文本转载识别是指从大规模文本库中检测出内容相同或相近的文档集合,在热门话题检测、搜索引擎结果凝练、学术文章抄袭识别等诸多应用上,存在普遍的需求。为适应网络文本转载形式的日趋多样化,并进一步提升实用系统效率,对各种文本特征及比较算法进行了研究分析,提出了基于聚团词的大规模文本转载识别算法,即:依据词语的分布属性,识别并提取高得分聚团词用于表征文本,之后通过对文本集进行扩展线性比较与多维比较两次操作,最终筛选出转载识别结果。对比实验表明:该算法在准确率、召回率与效率上有较高的综合性能。

关键词: 转载识别, 聚团词, 特征选择, 扩展线性比较, 向量空间模型

Abstract: Document forward detection is that to find out article collection of the same or close content from a large-scale text library. It has widespread demand in popular articles exploring, results organizing of search engine, copy detection and so on. To meet the growing diverse forms of Internet text forward and improve system efficiency, this paper discussed certain text features and researched some comparison algorithms. Then, the large-scale document forward detection algorithm based on agglomerate-term was introduced. Its principle is: first, detect and extract the agglomerate-term according to the term's distribution, and make it a key feature to characterize the text; then, set an extensive linear comparison and a multi-dimensional comparison on it; finally, compute the ultimate results of the forward detection. The experimental results show that the agglomerate-term algorithm has a better integrated performance of precision, recall and speed.

Key words: forward detection, Agglomerate-Term (AgT), feature selection, extensive linear comparison, Vector Space Model (VSM)