基于聚团词的大规模文本转载识别算法

计算机应用 ›› 2010, Vol. 30 ›› Issue (06): 1661-1663.

• 软件过程技术与中文信息处理 • 上一篇下一篇

基于聚团词的大规模文本转载识别算法

张京阳¹,张华平²,刘金刚³

1. 北京中科天玑信息技术有限公司
2.
3. 首都师范大学计算机科学联合研究院

收稿日期:2009-12-15 修回日期:2010-02-10 发布日期:2010-06-01 出版日期:2010-06-01
通讯作者: 张京阳
基金资助:
国家863高新技术研究项目

Large-scale document forward detection algorithm based on agglomerate-term

Received:2009-12-15 Revised:2010-02-10 Online:2010-06-01 Published:2010-06-01
Contact: ZHANG Jing-Yang

摘要/Abstract

摘要： 文本转载识别是指从大规模文本库中检测出内容相同或相近的文档集合,在热门话题检测、搜索引擎结果凝练、学术文章抄袭识别等诸多应用上,存在普遍的需求。为适应网络文本转载形式的日趋多样化,并进一步提升实用系统效率,对各种文本特征及比较算法进行了研究分析,提出了基于聚团词的大规模文本转载识别算法,即:依据词语的分布属性,识别并提取高得分聚团词用于表征文本,之后通过对文本集进行扩展线性比较与多维比较两次操作,最终筛选出转载识别结果。对比实验表明:该算法在准确率、召回率与效率上有较高的综合性能。

关键词: 转载识别, 聚团词, 特征选择, 扩展线性比较, 向量空间模型

Abstract: Document forward detection is that to find out article collection of the same or close content from a large-scale text library. It has widespread demand in popular articles exploring, results organizing of search engine, copy detection and so on. To meet the growing diverse forms of Internet text forward and improve system efficiency, this paper discussed certain text features and researched some comparison algorithms. Then, the large-scale document forward detection algorithm based on agglomerate-term was introduced. Its principle is: first, detect and extract the agglomerate-term according to the term's distribution, and make it a key feature to characterize the text; then, set an extensive linear comparison and a multi-dimensional comparison on it; finally, compute the ultimate results of the forward detection. The experimental results show that the agglomerate-term algorithm has a better integrated performance of precision, recall and speed.

Key words: forward detection, Agglomerate-Term (AgT), feature selection, extensive linear comparison, Vector Space Model (VSM)

张京阳张华平刘金刚. 基于聚团词的大规模文本转载识别算法[J]. 计算机应用, 2010, 30(06): 1661-1663.

[1]	湛航, 何朗, 黄樟灿, 李华峰, 张蔷, 谈庆. 改进的基于层次距离的基因表达式编程特征选择分类算法[J]. 计算机应用, 2021, 41(9): 2658-2667.
[2]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[3]	李蒙蒙, 秦伟, 刘艺, 刁兴春. 结合头脑风暴优化的混合蚁群优化算法[J]. 计算机应用, 2021, 41(8): 2412-2417.
[4]	贾鹤鸣, 姜子超, 李瑶, 孙康健. 基于改进斑点鬣狗优化算法的同步优化特征选择[J]. 计算机应用, 2021, 41(5): 1290-1298.
[5]	林筠超, 万源. 基于图结构优化的自适应多度量非监督特征选择方法[J]. 计算机应用, 2021, 41(5): 1282-1289.
[6]	张志浩, 林耀进, 卢舜, 郭晨, 王晨曦. 缺失标记下基于类属属性的多标记特征选择[J]. 计算机应用, 2021, 41(10): 2849-2857.
[7]	黄学雨, 徐浩特, 陶剑文. 具有特征选择的多源自适应分类框架[J]. 计算机应用, 2020, 40(9): 2499-2506.
[8]	顾桐, 许国良, 李万林, 李家浩, 王志愿, 雒江涛. 基于集成LightGBM和贝叶斯优化策略的房价智能评估模型[J]. 计算机应用, 2020, 40(9): 2762-2767.
[9]	肖跃雷, 张云娇. 基于特征选择和超参数优化的恐怖袭击组织预测方法[J]. 计算机应用, 2020, 40(8): 2262-2267.
[10]	刘丹, 姚立霜, 王云锋, 裴作飞. 面向类不平衡流量数据的分类模型[J]. 计算机应用, 2020, 40(8): 2327-2333.
[11]	汪志远, 降爱莲, 奥斯曼·穆罕默德. 基于正则互表示的无监督特征选择方法[J]. 计算机应用, 2020, 40(7): 1896-1900.
[12]	谢琪, 徐旭, 程耕国, 陈和平. 基于新的森林优化算法的特征选择算法[J]. 计算机应用, 2020, 40(5): 1266-1271.
[13]	曹堉, 王成, 王鑫, 高悦尔. 基于时空节点选择和深度学习的城市道路短时交通流预测[J]. 计算机应用, 2020, 40(5): 1488-1493.
[14]	曾元鹏, 王开军, 林崧. 面向二类区分能力的干扰熵特征选择方法[J]. 计算机应用, 2020, 40(3): 626-630.
[15]	章夏杰, 朱敬华, 陈杨. Spark下的分布式粗糙集属性约简算法[J]. 计算机应用, 2020, 40(2): 518-523.

基于聚团词的大规模文本转载识别算法

Large-scale document forward detection algorithm based on agglomerate-term

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics