计算机应用 ›› 2019, Vol. 39 ›› Issue (6): 1701-1706.DOI: 10.11772/j.issn.1001-9081.2018102106

• 数据科学与技术 • 上一篇    下一篇

基于Word2Vec模型特征扩展的Web搜索结果聚类性能的改进

杨楠, 李亚平   

  1. 中国人民大学 信息学院, 北京 100872
  • 收稿日期:2018-10-19 修回日期:2018-12-13 发布日期:2019-06-17 出版日期:2019-06-10
  • 通讯作者: 李亚平
  • 作者简介:杨楠(1962-),男,辽宁辽阳人,副教授,博士,CCF会员,主要研究方向:数据挖掘、Web挖掘、机器学习;李亚平(1976-),女,河北石家庄人,讲师,博士研究生,CCF会员,主要研究方向:统计学习、数据分析。
  • 基金资助:
    国家自然科学基金资助项目(61773385)。

Improvement of Web search result clustering performance based on Word2Vec model feature extension

YANG Nan, LI Yaping   

  1. School of Information, Renmin University of China, Beijing 100872, China
  • Received:2018-10-19 Revised:2018-12-13 Online:2019-06-17 Published:2019-06-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61773385).

摘要: 对于用户泛化和模糊的查询,将Web搜索引擎返回的列表内容聚类处理,便于用户有效查找感兴趣的内容。由于返回的列表由称为片段(snippet)的短文本组成,而传统的单词频率-逆文档频率(TF-IDF)特征选择模型不能适用于稀疏的短文本,使得聚类性能下降。一个有效的方法就是通过一个外部的知识库对短文本进行扩展。受到基于神经网络词表示方法的启发,提出了通过词嵌入技术的Word2Vec模型对短文本扩展,即采用Word2Vec模型的TopN个最相似的单词用于对片段(snippet)的扩展,扩展文档使得TF-IDF模型特征选择得到聚类性能的提高。同时考虑到通用性单词造成的噪声引入,对扩展文档的TF-IDF矩阵进行了词频权重修正。实验在两个公开数据集ODP239和SearchSnippets上完成,将所提方法和纯snippet无扩展的方法、基于Wordnet的特征扩展方法和基于Wikipedia的特征扩展方法进行了对比。实验结果表明,所提方法在聚类性能方面优于对比方法。

关键词: 特征扩展, 片段, 词嵌入技术, 搜索结果聚类

Abstract: Aiming at generalized or fuzzy queries, the content of the returned list of Web search engines is clustered to help users to find the desired information quickly. Generaly, the returned list consists of short texts called snippets carring few information which traditional Term Frequency-Inverse Document Frequency (TF-IDF) feature selection model is not suitable for, so the clustering performance is very low. An effective way to solve this problem is to extend snippets according to a external knowledge base. Inspired by neural network based word presenting method, a new snippet extension approach based on Word2Vec model was proposed. In the model, TopN similar words in Word2Vec model were used to extend snippets and the extended text was able to improve the clustering performance of TF-IDF feature selection. Meanwhile,in order to reduce the impact of noise caused by some common used terms, the term frequency weight in TF-IDF matrix of the extended text was modified. The experiments were conducted on two open datasets OPD239 and SearchSnippets to compare the proposed method with pure snippets, Wordnet based and Wikipedia based feature extensions. The experimental results show that the proposed method outperforms other comparative methods significantly in term of clustering effect.

Key words: feature extension, snippet, word embedding technology, search result clustering

中图分类号: