Deep Web resource selection using topic model

doi:10.11772/j.issn.1001-9081.2015.09.2553

Abstract

Abstract: Federated search is a widely-used technique to find information on Deep Web. Given a user query, one of the challenges for a federated search system is to select a set of resources that are most likely to return relevant results for the query. Most existing resource selection methods are based on text-matching between the sample documents of the resource and the query, which typically suffer the problem of missing vocabulary or incomplete information. To alleviate the problem of incomplete information, Latent Dirichlet Allocation (LDA) topic model approach for resource selection was proposed. First, topic probability distributions for resources and query were inferred using LDA topic model approach. Then the similarities between the topic distributions of resources and query were calculated to rank the resources. By mapping both resources and the query into the low dimensional topic space, the problem of missing information caused by the sparsity of high dimensional word space was alleviated. Experiments were conducted on the test sets of TREC FedWeb 2013 and 2014 Tracks, and the results were compared with that of other participants in the Tracks. The experimental results on the TREC FedWeb 2013 Track show that the LDA based approach outperforms the best result of other participants by 24%; and the results on the TREC FedWeb 2014 Track show that it outperforms the best results of the traditional text-matching-based resource selection methods using either small-or big-document strategies by 22% for small-document methods and 43% for big-document methods respectively. In addition, using sampled snippets rather than documents to generate big-document representation for resources can significantly improve the efficiency of the system, thus enables the proposed approach more feasible and applicable in practice.

Key words: deep Web, topic model, Latent Dirichlet Allocation (LDA), data resource selection, federated search

摘要： 联邦搜索是从大规模深层网上获取信息的一种重要技术。给定一个用户查询,联邦搜索系统需要解决的一个主要问题是数据源选择问题,即从海量数据源中选出一组最有可能返回相关结果的数据源。现有的数据源选择算法大多基于数据源的样本文档集和查询之间的关键词匹配,通常无法很好地解决少量样本文档的信息缺失问题。针对这一问题,提出了基于隐含狄利克雷分布(LDA)主题模型进行数据源选择的方法。首先,使用LDA主题模型获得数据源和查询的主题概率分布;然后,通过比较两者主题概率分布的相近性来对所有数据源进行排序。通过将数据源和查询映射到低维的主题空间来解决高维词条空间稀疏性所带来的信息缺失问题。在TREC FedWeb 2013和2014 Track的测试集上分别进行了实验,并和其他参赛方法的结果进行了比较。在FedWeb 2013测试集上的实验结果显示比其他参赛方法的最好结果提高了24%;在FedWeb 2014测试集上的实验结果显示比传统的基于小文档和大文档的关键词匹配方法分别提高了22%和43%。另外,使用文档片段来代替文档还可以大幅提升系统的效率,更增加了此方法的实用性和可行性。

关键词: 深层网, 主题模型, 隐含狄利克雷分布, 数据源选择, 联邦搜索

CLC Number:

TP391.3

WANG Qiuyue, CAO Wei, SHI Shaochen. Deep Web resource selection using topic model[J]. Journal of Computer Applications, 2015, 35(9): 2553-2559.

王秋月, 曹巍, 史少晨. 基于主题模型的深层网数据源选择算法[J]. 计算机应用, 2015, 35(9): 2553-2559.

References

[1] BERGMAN M K. The deep Web: surfacing hidden value [J]. Journal of Electronic Publishing, 2001,7(1):113-153.
[2] HE B, PATEL M, ZHANG Z, et al. Accessing the deep Web: a survey [J]. Communications of ACM, 2007,50(5):94-101.
[3] MADHAVAN J, JEFFERY S, COHEN S, et al. Web-scale data integration: you can only afford to pay as you go [EB/OL]. [2015-01-04]. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.9358&rep=rep1&type=pdf.
[4] CAFARELLA M J, HALEVY A, MADHAVAN J. Structured data on the Web [J]. Communications of ACM, 2011,54(2):72-79.
[5] MADHAVAN J, KO D, KOT L, et al. Google's deep Web crawl [J]. Proceedings of the Very Large Data Base Endowment, 2008,1(2):1241-1252.
[6] ARGUELLO J, CALLAN J, DIAZ F. Classification-based resource selection [C]//Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York: ACM, 2009:1277-1286.
[7] SHAN J, MAN L. Simple may be best -a simple and effective method for federated Web search via search engine impact factor estimation [EB/OL]. [2015-01-06]. http://trec.nist.gov/pubs/trec23/papers/pro-ECNU_federated.pdf.
[8] CALLAN J, CONNELL M. Query-based sampling of text databases [J]. ACM Transactions on Information Systems, 2011,19(2):97-130.
[9] HIEMSTRA D, DEMEESTER T, TRIESCHNIGG D. TREC federated Web search track [EB/OL]. [2015-01-03]. https://sites.google.com/site/trecfedweb/.
[10] CALLAN J P, LU Z, CROFT W B. Searching distributed collections with inference networks [C]//Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1995:21-28.
[11] SI L, JIN R, CALLAN J, et al. A language modeling framework for resource selection and results merging [C]//Proceedings of the 11th International Conference on Information and Knowledge Management. New York: ACM, 2002:391-397.
[12] SEO J, CROFT W B. Blog site search using resource selection [C]//Proceedings of the 17th ACM Conference on Information and Knowledge Management. New York: ACM, 2008:1053-1062.
[13] SI L, CALLAN J. Relevant document distribution estimation method for resource selection [C]//Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2003:298-305.
[14] SHOKOUHI M. Central-rank-based collection selection in uncooperative distributed information retrieval [C]//Proceedings of the 29th European Conference on Information Retrieval. Berlin: Springer, 2007:160-172.
[15] IPEIROTIS P G, GRAVANO L. Classification-aware hidden-Web text database selection [EB/OL]. [2015-01-08]. http://128.59.11.212/~gravano/Papers/2008/tois08.pdf.
[16] BELLOGIN A, GEBREMESKEL G G, HE J, et al. CWI and TU delft at TREC 2013: contextual suggestion, federated Web search, KBA, and Web tracks [EB/OL]. [2015-01-08]. http://ir.ii.uam.es/~alejandro/2013/trec.pdf.
[17] XU J, CROFT W B. Cluster-based language models for distributed retrieval [C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1999:254-261.
[18] BAILLIE M, CARMEN M, CRESTANI F. A multiple-collection latent topic model for federated search [J]. Information Retrieval, 2011,14(4):390-412.
[19] DEMEESTER T, NGUYEN D, TRIESCHNIGG D, et al. What snippets say about pages in federated Web search [C]//Proceedings of the 8th Asia Information Retrieval Societies Conference. Berlin: Springer, 2012:250-261.
[20] DEMEESTER T, NGUYEN D, TRIESCHNIGG D, et al. Snippet-based relevance predictions for federated Web search [C]//Proceedings of the 35th European Conference on Advances in Information Retrieval. Berlin: Springer, 2013:697-700.
[21] CALLAN J. Distributed IR testbed definitions [EB/OL]. [2015-01-08]. http://boston.lti.cs.cmu.edu/callan/Data/#DIR.
[22] NGUYEN D, DEMEESTER T, TRIESCHNIGG D, et al. Federated search in the wild: the combined power of over a hundred search engines [C]//Proceedings of the 21st ACM Conference on Information and Knowledge Management. New York: ACM, 2012:1874-1878.
[23] DEMEESTER T, TRIESCHNIGG D, NGUYEN D, et al. Overview of the TREC 2013 federated Web search track [EB/OL]. [2015-01-02]. https://biblio.ugent.be/input/download?func=downloadFile&recordOId=4402037&fileOId=4402038.
[24] DEMEESTER T, TRIESCHNIGG D, NGUYEN D, et al. Overview of the TREC 2014 Federated Web Search Track [EB/OL]. [2015-01-02]. http://www.dcs.gla.ac.uk/~zhouke/papers/trec2014fedweb-draft.pdf.
[25] DEMEESTER T, ALY R, HIEMSTRA D, et al. Exploiting user disagreement for Web search evaluation: an experimental approach [C]//Proceedings of the 7th ACM International Conference on Web Search and Data Mining. New York: ACM, 2014:33-42.
[26] KEKÄLÄINEN J, JÄRVELIN K. Using graded relevance assessments in IR evaluation [J]. Journal of the American Society for Information Science and Technology, 2002,53(13):1120-1129.
[27] MCCALLUM A K. MALLET: a machine learning for language toolkit [EB/OL]. [2015-01-02]. http://mallet.cs.umass.edu.
[28] LIU Z, ZHANG Y, CHANG E Y, et al. PLDA+: parallel latent Dirichlet allocation with data placement and pipeline processing [J]. ACM Transactions on Intelligent Systems and Technology, 2011,2(3):Article No. 26.
[29] SHOKOUHI M, SI L. Federated search [J]. Foundations and Trends in Information Retrieval, 2011,5(1):1-102.
[30] GABRILOVICH E, MARKOVITCH S. Computing semantic relatedness using Wikipedia-based explicit semantic analysis [C]//Proceedings of the 20th International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers, 2007:1606-1611.