基于文本聚类与分布式Lucene的知识检索

doi:10.3724/SP.J.1087.2013.00186

计算机应用 ›› 2013, Vol. 33 ›› Issue (01): 186-188.DOI: 10.3724/SP.J.1087.2013.00186

基于文本聚类与分布式Lucene的知识检索

冯汝伟,谢强,丁秋林

南京航空航天大学计算机科学与技术学院，南京 210016

收稿日期:2012-07-23 修回日期:2012-08-22 发布日期:2013-01-09 出版日期:2013-01-01
通讯作者: 冯汝伟
作者简介:冯汝伟(1988-)，男，江苏江阴人，硕士研究生，主要研究方向：分布式计算；谢强(1972-)，男，四川自贡人，副教授，博士，主要研究方向：知识工程、信息系统、信息安全；丁秋林(1935-)，男，江西抚州人，教授，博士生导师，主要研究方向：航空宇航制造工程、管理与信息化。

Knowledge retrieval based on text clustering and distributed Lucene

FENG Ruwei,XIE Qiang,DING Qiulin

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing Jiangsu 210016, China

Received:2012-07-23 Revised:2012-08-22 Online:2013-01-09 Published:2013-01-01
Contact: FENG Ruwei

摘要/Abstract

摘要： 针对传统集中式索引处理大规模数据的性能和效率问题，提出了一种基于文本聚类的检索算法。利用文本聚类算法改进现有的索引划分方案，根据查询与聚类结果的距离计算判断查询意图，缩减查询范围。实验结果表明，所提方案能够有效地缓解大规模数据建索引和检索的压力，大幅提高分布式检索性能，同时保持着较高的准确率和查全率。

关键词: 非结构化知识, 分布式索引, 文本聚类, 全文检索, 并行检索

Abstract: To solve the low performance and efficiency issues of the traditional centralized index when processing large-scale unstructured knowledge, the authors proposed the retrieval algorithm based on text clustering. The algorithm used text clustering algorithm to improve the existing index distribution method, and reduced the search range by judging the query intent through the distance of query and clusters. The experimental results show that the proposed scheme can effectively alleviate the pressure of indexing and retrieval in handling large-scale data. It greatly improves the performance of distributed retrieval, and it still maintains relatively high accuracy rate and recall rate.

Key words: unstructured knowledge, distributed index, text clustering, full-text search, parallel retrieval

中图分类号:

TP391.3

冯汝伟谢强丁秋林. 基于文本聚类与分布式Lucene的知识检索[J]. 计算机应用, 2013, 33(01): 186-188.

FENG Ruwei XIE Qiang DING Qiulin. Knowledge retrieval based on text clustering and distributed Lucene[J]. Journal of Computer Applications, 2013, 33(01): 186-188.

参考文献

［1］蒋明原,孔令德,宁静静.一种海量数据下的Lucene全文检索解决方案［J］.电脑开发与应用,2011，24（4）：32-35.

［2］MOFFAT A, WEBBER W, ZOBEL J. Load balancing for term-distributed parallel retrieval ［C］// SIGIR'06: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2006: 348-355.

［3］曹宇,尹刚,李翔,等.聚类搜索引擎研究进展浅析［J］.电脑知识与技术,2011，7（22）：5398-5400.

［4］徐文海,温有奎.一种基于TFIDF方法的中文关键词抽取算法［J］.情报理论与实践,2008，31(2)：298-302.

［5］OWEN S, ANIL R, DUNNING T, et al. Mahout in action ［M］. Greenwich: Manning Publications, 2010: 123-137.

［6］ESTEVES R M, PAIS R, RONG C. K-means clustering in the cloud—a Mahout test ［C］// Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications. Washington, DC: IEEE Computer Society, 2011:514-519.

［7］ESTEVES R M, RONG C. Using Mahout for clustering Wikipedia's latest articles: a comparison between K-means and fuzzy C-means in the cloud ［C］// Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science. Washington, DC: IEEE Computer Society, 2011: 565-569.

［8］李应安.基于Map/Reduce的聚类算法的并行化研究［D］.广州：中山大学,2010.

［9］BUTLER M H, RUTHERFORD J. Distributed Lucene: a distributed free text index for Hadoop ［EB/OL］. ［2012-03-25］. http://www.hpl.hp.com/techreports/2008/HPL-2008-64.pdf.

［10］SAJJA K. Performance study of Lucene in parallel and distributed environments ［D］. Boise: Boise State University, 2011.

［11］HATCHER E, GOSPODNETIC O, McCANDLESS M. Lucene in action ［M］. Greenwich: Manning Publications, 2009.

［12］王浩,姚长利,郭琳,等.基于中文搜索引擎网络信息用户行为研究［J］.计算机应用研究,2009，26(12)：4665-4668.

[1]	蒋小霞, 黄瑞章, 白瑞娜, 任丽娜, 陈艳平. 基于事件表示和对比学习的深度事件聚类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1734-1742.
[2]	马胜位, 黄瑞章, 任丽娜, 林川. 基于多层语义融合的结构化深度文本聚类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2364-2369.
[3]	崔双双, 王宏志. 基于日志结构合并树的轻量级分布式索引实现方法[J]. 计算机应用, 2021, 41(3): 630-635.
[4]	曹大为, 贺超波, 陈启买, 刘海. 基于加权核非负矩阵分解的短文本聚类算法[J]. 计算机应用, 2018, 38(8): 2180-2184.
[5]	王日宏, 崔兴梅. 融合集群度与距离均衡优化的K-均值聚类算法[J]. 计算机应用, 2018, 38(1): 104-109.
[6]	唐黎哲, 冯大为, 李东升, 李荣春, 刘锋. 以LDA为例的大规模分布式机器学习系统分析[J]. 计算机应用, 2017, 37(3): 628-634.
[7]	翁海星, 宫学庆, 朱燕超, 胡华梁. 集群环境下分布式索引的实现[J]. 计算机应用, 2016, 36(1): 1-7.
[8]	王春龙张敬旭. 基于LDA的改进K-means算法在文本聚类中的应用[J]. 计算机应用, 2014, 34(1): 249-254.
[9]	马健张太红陈燕红. 中文搜索引擎分块倒排索引存储模式[J]. 计算机应用, 2013, 33(07): 2031-2036.
[10]	侯海霞原民民刘春霞. 面向大文本数据集的间接谱聚类[J]. 计算机应用, 2012, 32(12): 3274-3277.
[11]	李劲张华吴浩雄向军. 基于特定领域的中文微博热点话题挖掘系统BTopicMiner[J]. 计算机应用, 2012, 32(08): 2346-2349.
[12]	张维刚徐永东雷小强何辉. Web全文检索中间件的设计与应用[J]. 计算机应用, 2011, 31(08): 2261-2264.
[13]	张玉芳朱俊熊忠阳. 改进的概率潜在语义分析下的文本聚类算法[J]. 计算机应用, 2011, 31(03): 674-676.
[14]	张文明吴江袁小蛟. 基于密度和最近邻的Kk-means文本聚类算法[J]. 计算机应用, 2010, 30(07): 1933-1935.
[15]	况夯罗军. 基于遗传FCM算法的文本聚类[J]. 计算机应用, 2009, 29(2): 558-560.

基于文本聚类与分布式Lucene的知识检索

Knowledge retrieval based on text clustering and distributed Lucene

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics