基于Word2Vec模型特征扩展的Web搜索结果聚类性能的改进

doi:10.11772/j.issn.1001-9081.2018102106

计算机应用 ›› 2019, Vol. 39 ›› Issue (6): 1701-1706.DOI: 10.11772/j.issn.1001-9081.2018102106

基于Word2Vec模型特征扩展的Web搜索结果聚类性能的改进

杨楠, 李亚平

中国人民大学信息学院, 北京 100872

收稿日期:2018-10-19 修回日期:2018-12-13 发布日期:2019-06-17 出版日期:2019-06-10
通讯作者: 李亚平
作者简介:杨楠(1962-),男,辽宁辽阳人,副教授,博士,CCF会员,主要研究方向:数据挖掘、Web挖掘、机器学习;李亚平(1976-),女,河北石家庄人,讲师,博士研究生,CCF会员,主要研究方向:统计学习、数据分析。
基金资助:
国家自然科学基金资助项目（61773385）。

Improvement of Web search result clustering performance based on Word2Vec model feature extension

YANG Nan, LI Yaping

School of Information, Renmin University of China, Beijing 100872, China

Received:2018-10-19 Revised:2018-12-13 Online:2019-06-17 Published:2019-06-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61773385).

摘要/Abstract

摘要： 对于用户泛化和模糊的查询，将Web搜索引擎返回的列表内容聚类处理，便于用户有效查找感兴趣的内容。由于返回的列表由称为片段（snippet）的短文本组成，而传统的单词频率-逆文档频率（TF-IDF）特征选择模型不能适用于稀疏的短文本，使得聚类性能下降。一个有效的方法就是通过一个外部的知识库对短文本进行扩展。受到基于神经网络词表示方法的启发，提出了通过词嵌入技术的Word2Vec模型对短文本扩展，即采用Word2Vec模型的TopN个最相似的单词用于对片段（snippet）的扩展，扩展文档使得TF-IDF模型特征选择得到聚类性能的提高。同时考虑到通用性单词造成的噪声引入，对扩展文档的TF-IDF矩阵进行了词频权重修正。实验在两个公开数据集ODP239和SearchSnippets上完成，将所提方法和纯snippet无扩展的方法、基于Wordnet的特征扩展方法和基于Wikipedia的特征扩展方法进行了对比。实验结果表明，所提方法在聚类性能方面优于对比方法。

关键词: 特征扩展, 片段, 词嵌入技术, 搜索结果聚类

Abstract: Aiming at generalized or fuzzy queries, the content of the returned list of Web search engines is clustered to help users to find the desired information quickly. Generaly, the returned list consists of short texts called snippets carring few information which traditional Term Frequency-Inverse Document Frequency (TF-IDF) feature selection model is not suitable for, so the clustering performance is very low. An effective way to solve this problem is to extend snippets according to a external knowledge base. Inspired by neural network based word presenting method, a new snippet extension approach based on Word2Vec model was proposed. In the model, TopN similar words in Word2Vec model were used to extend snippets and the extended text was able to improve the clustering performance of TF-IDF feature selection. Meanwhile,in order to reduce the impact of noise caused by some common used terms, the term frequency weight in TF-IDF matrix of the extended text was modified. The experiments were conducted on two open datasets OPD239 and SearchSnippets to compare the proposed method with pure snippets, Wordnet based and Wikipedia based feature extensions. The experimental results show that the proposed method outperforms other comparative methods significantly in term of clustering effect.

Key words: feature extension, snippet, word embedding technology, search result clustering

中图分类号:

TP391.1

杨楠, 李亚平. 基于Word2Vec模型特征扩展的Web搜索结果聚类性能的改进[J]. 计算机应用, 2019, 39(6): 1701-1706.

YANG Nan, LI Yaping. Improvement of Web search result clustering performance based on Word2Vec model feature extension[J]. Journal of Computer Applications, 2019, 39(6): 1701-1706.

参考文献

[1] CARPINETO C, OSINSKI S, ROMANO G, et al. A survey of Web clustering engines[J]. ACM Computing Surveys, 2009, 41(3):Article No. 17.
[2] CARPINETO C, ROMANO G. Optimal meta search results clustering[C]//Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2010:170-177.
[3] PHAN X H, NGUYEN L M, HORIGUCHI S. Learning to classify short and sparse text & Web with hidden topics from large-scale data collections[C]//WWW 2008:Proceedings of the 17th International Conference on World Wide Web. New York:ACM, 2008:91-100.
[4] BOLLEGALA D, MATSUO Y, ISHIZUKA M. Measuring semantic similarity between words using Web search engines[C]//Proceedings of the 16th International Conference on World Wide Web. New York:ACM, 2007:757-766.
[5] HOTHO A, STAAB S, STUMME G. Ontologies improve text document clustering[C]//ICDM 2003:Proceedings of the Third IEEE International Conference on Data Mining. Washington, DC:IEEE Computer Society, 2003:541-544.
[6] BANERJEE S, RAMANATHAN K, GUPTA A. Clustering short texts using Wikipedia[C]//Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2007:787-788.
[7] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(6):1137-1155.
[8] MNIH A, HINTON G E. Three new graphical models for statistical language modelling[C]//Proceedings of the Twenty-Fourth International Conference on Machine Learning. New York, ACM:2007:641-648.
[9] MIKOLOV T. Statistical language models based on neural networks[D]. Brno:Brno University of Technology, 2012:26-43.
[10] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011, 12(7):2493-2537.
[11] SOCHER R, PENNINGTON J, HUANG E H, et al. Semi-supervised recursive autoencoders for predicting sentiment distributions[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2011:151-161.
[12] GHOSH S, CHARKRABORTY P, COHN E, et al. Characterizing diseases from unstructured text:a vocabulary driven Word2Vec approach[C]//Proceedings of the 25th ACM International Conference on Information and Knowledge Management. New York:ACM, 2016:1129-1138.
[13] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL].[2018-08-16]. http://www.surdeanu.info/mihai/teaching/ista555-spring15/readings/mikolov2013.pdf.
[14] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. North Miami Beach, FL:Curran Associates Inc., 2013:3111-3119.
[15] MIKOLOV T, YIH W, ZWEIG G. Linguistic regularities in continuous space word representations[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association of Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2013:746-751.
[16] GABRILOVICH E, MARKOVITCH S. Computing semantic relatedness using Wikipedia-based explicit semantic analysis[C]//Proceedings of the 20th International Joint Conference on Artificial Intelligence. San Francisco, CA:Morgan Kaufmann Publishers Inc., 2007:1606-1611.
[17] XU W, LIU X, GONG Y H. Document clustering based on non-negative matrix factorization[C]//Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2003:267-273.
[18] PAPADIMITRIOU C H, STEIGLITZ K. Combinatorial Optimization:Algorithms and Complexity[M]. New York:Courier Dover Publications, 1998:248-254.

基于Word2Vec模型特征扩展的Web搜索结果聚类性能的改进

Improvement of Web search result clustering performance based on Word2Vec model feature extension

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 14

编辑推荐

Metrics

[1]	刘青, 陈艳平, 邹安琪, 黄瑞章, 秦永彬. 面向机器阅读理解的边界感知方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2004-2010.
[2]	高颖杰, 林民, 斯日古楞null, 李斌, 张树钧. 基于片段抽取原型网络的古籍文本断句标点提示学习方法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3815-3822.
[3]	颜新月, 杨淑群, 高永彬. 基于证据增强与多特征融合的文档级关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3379-3385.
[4]	潘列, 曾诚, 张海丰, 温超东, 郝儒松, 何鹏. 结合广义自回归预训练语言模型与循环卷积神经网络的文本情感分析方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1108-1115.
[5]	雷靖玮, 伊鹏, 陈祥, 王亮, 毛明. 基于系统调用和数据溯源的PDF文档检测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3831-3840.
[6]	卢志刚, 解婉婷. 基于片段的企业信任网络演化图聚类算法[J]. 计算机应用, 2018, 38(1): 270-276.
[7]	邓扬, 张晨曦, 李江峰. 基于弹幕情感分析的视频片段推荐模型[J]. 计算机应用, 2017, 37(4): 1065-1070.
[8]	邢金彪, 崔超远, 孙丙宇, 宋良图. 基于隐含狄列克雷分配分类特征扩展的微博广告过滤方法[J]. 计算机应用, 2016, 36(8): 2257-2261.
[9]	李怀松袁琴王才华刘娟. 蛋白质—小分子相互作用模型的构建[J]. 计算机应用, 2014, 34(7): 2129-2131.
[10]	王盛樊兴华陈现麟. 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010, 30(3): 603-606.
[11]	王细薇樊兴华赵军. 一种基于特征扩展的中文短文本分类方法[J]. 计算机应用, 2009, 29(3): 843-845.
[12]	张剑妹陶世群. 基于模式的XML路径表达式查询处理技术[J]. 计算机应用, 2009, 29(11): 3080-3083.
[13]	石磊程刚运. 共享片段自动检测模型[J]. 计算机应用, 2009, 29(05): 1197-1200.
[14]	施智平，李清勇，史俊，史忠植. 基于关键帧序列的视频片段检索[J]. 计算机应用, 2005, 25(08): 1783-1785.