基于稳定性语义聚类的相关模型估计

doi:10.11772/j.issn.1001-9081.2016.05.1313

计算机应用 ›› 2016, Vol. 36 ›› Issue (5): 1313-1318.DOI: 10.11772/j.issn.1001-9081.2016.05.1313

基于稳定性语义聚类的相关模型估计

孙芯宇¹, 吴江¹, 蒲强²

1. 西南财经大学经济信息工程学院, 成都 611130;
2. 成都大学信息科学与工程学院, 成都 610106

收稿日期:2015-10-21 修回日期:2016-01-07 出版日期:2016-05-10 发布日期:2016-05-09
通讯作者: 蒲强
作者简介:孙芯宇(1991-),女,河北承德人,硕士研究生,主要研究方向:文本挖掘、用户个性化推荐;吴江(1980-),男,浙江衢州人,副教授,博士,主要研究方向:数据挖掘;蒲强(1971-),男,四川内江人,副教授,博士,主要研究方向:信息检索、统计语言模型、位置服务。
基金资助:
教育部人文社会科学研究青年基金资助项目(11YJCZH084);四川省科技厅科技支撑计划项目(2014GZ0013,2014SZ0107);四川省教育厅自然科学重点项目(13ZA0297)。

Relevance model estimation based on stable semantic clustering

SUN Xinyu¹, WU Jiang¹, PU Qiang²

1. School of Economic Information Engineering, Southwestern University of Finance and Economics, Chengdu Sichuan 611130, China;
2. School of Information Science and Engineering, Chengdu University, Chengdu Sichuan 610106, China

Received:2015-10-21 Revised:2016-01-07 Online:2016-05-10 Published:2016-05-09
Supported by:
This work is partially supported by the Humanity and Social Sciences Research of Higher Education of China for Youth (11YJCZH084), the Science and Technology Support Project of Sichuan Province (2014GZ0013, 2014SZ0107), the Education Department Natural Science Project of Sichuan Province (13ZA0297).

摘要/Abstract

摘要： 针对由不稳定聚类估计的相关模型影响检索性能的问题,提出了基于稳定性语义聚类的相关模型(SSRM)。首先利用初始查询前N个结果文档构成反馈数据集;然后探测数据集中稳定的语义类别数量;接着从稳定性语义聚类中选择与用户查询最相似的语义类别估计SSRM;最后通过实验对模型的检索性能进行了验证。对TREC数据集5个子集的实验结果显示,SSRM相比相关模型(RM)、语义相关模型(SRM),平均准确率(MAP)性能最少提高了32.11%和0.41%;相比基于聚类的文档模型(CBDM)、基于LDA的文档模型(LBDM)和Resampling等基于聚类的检索方法,MAP性能最少提高了23.64%,19.59%和8.03%。实验结果表明,SSRM有利于改善检索性能。

关键词: 信息检索, 语义聚类, 稳定性验证, 独立分量分析, 相关模型估计

Abstract: To solve the problem of relevance model based on unstable clustering estination and its effect on retrieval performance, a new Stable Semantic Relevance Model (SSRM) was proposed. The feedback data set was first formed by using the top N documents from user initial query, after the stable number of semantic clusters had been detected, SSRM was estimated by those stable semantic clusters selected according to higher user-query similarity. Finally, the SSRM retrieval performance was verified by experiments. Compared with Relevance Model (RM), Semantic Relevance Model (SRM) and the clustering-based retrieval methods including Cluster-Based Document Model (CBDM), LDA-Based Document Model (LBDM) and Resampling, SSRM has improvement of MAP by at least 32.11%, 0.41%, 23.64%,19.59%, 8.03% respectively. The experimental results show that retrieval performance can benefit from SSRM.

Key words: information retrieval, semantic clustering, stability validation, Independent Component Analysis (ICA), relevance model estimation

中图分类号:

TP391.3

孙芯宇, 吴江, 蒲强. 基于稳定性语义聚类的相关模型估计[J]. 计算机应用, 2016, 36(5): 1313-1318.

SUN Xinyu, WU Jiang, PU Qiang. Relevance model estimation based on stable semantic clustering[J]. Journal of Computer Applications, 2016, 36(5): 1313-1318.

参考文献

[1] LIU X, CROFT W B. Cluster-based retrieval using language models[C]//Proceedings of the 27th International Conference on Research and Development in Information Retrieval. New York:ACM, 2004:186-193.
[2] LEE K S, CROFT W B, ALLAN J. A cluster-based resampling method for pseudo-relevance feedback[C]//Proceedings of the 31st International Conference on Research and Development in Information Retrieval. New York:ACM, 2008:235-242.
[3] NASIR J A, VARLAMIS I, KARIM A, et al. Semantic smoothing for text clustering[J]. Knowledge-Based Systems, 2013, 54(4):216-229.
[4] ALSULAMI B S, ABULKHAIR M F, ESSA F A. Semantic clustering approach based multi-Agent system for information retrieval on Web[J]. International Journal of Computer Science & Network Security, 2012, 12(1):41-44.
[5] HOFMANN T. Probabilistic latent semantic indexing[C]//Proceedings of the 22nd International Conference on Research and Development in Information Retrieval. New York:ACM, 1999:56-73.
[6] HYVARINEN A. Survey on independent component analysis[J]. Neural Computing Surveys, 1999, 2(7):1527-1558.
[7] HIMBERG J, HYVARINEN A, ESPOSITO F. Validating the independent components of neuroimaging time-series via clustering and visualization[J]. Neuroimage, 2004, 22(3):1214-1222.
[8] PU Q, HE D. Pseudo relevance feedback using semantic clustering in relevance language model[C]//Proceedings of the 18th ACM International Conference on Information and Knowledge Management. New York:ACM, 2009:1931-1934.
[9] 蒲强, 何大庆, 杨国纬.一种基于统计语义聚类的查询语言模型估计[J].计算机研究与发展, 2011, 48(2):224-231.(PU Q, HE D Q, YANG G W. An estimation of query language model based on statistical semantic clustering[J]. Journal of Computer Research and Development, 2011, 48(2):224-231.)
[10] 刘家辰, 苗启广, 宋建锋. 使用聚类稳定性分析方法增强单类学习算法[J]. 西安电子科技大学学报(自然科学版), 2015, 2(2):58-64. (LIU J C, MIAO Q G, SONG J F. Enhanced one-class learning based on clustering stability analysis[J]. Journal of Xidian University (Natural Science), 2015, 42(2):58-64.)
[11] LAVRENKO V, CROFT W B. Relevance-based language models[C]//Proceedings of the 24th International Conference on Research and Development in Information Retrieval. New York:ACM, 2001:120-127.
[12] 刘铭, 刘秉权, 刘远超.面向信息检索的快速聚类算法[J].计算机研究与发展, 2013, 50(7):1452-1463.(LIU M, LIU B Q, LIU Y C. A fast clustering algorithm for information retrieval[J]. Journal of Computer Research and Development, 2013, 50(7):1452-1463.)
[13] 张永, 浮盼盼, 张玉婷.基于分层聚类及重采样的大规模数据分类[J].计算机应用, 2013, 33(10):2801-2803.(ZHANG Y, FU P P, ZHANG Y T. Large-scale data classification based on hierarchical clustering and re-sampling[J]. Journal of Computer Applications, 2013, 33(10):2801-2803.)
[14] KOLENDA T, HANSEN L K, SIGURDSSON S. Independent components in text[J]. Perspectives in Neural Computing, 2000, 32:235-256.
[15] WEI X, CROFT W B. LDA-based document models for Ad Hoc retrieval[C]//Proceedings of the 29th International Conference on Research and Development in Information Retrieval. New York:ACM, 2006:178-185.

基于稳定性语义聚类的相关模型估计

Relevance model estimation based on stable semantic clustering

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	沈力, 刘洪星, 李勇华. 基于版本控制的中文文档到源代码的自动跟踪方法[J]. 计算机应用, 2018, 38(10): 2996-3001.
[2]	章宁, 陈钦. 基于TF-IDF算法的P2P贷款违约预测模型[J]. 计算机应用, 2018, 38(10): 3042-3047.
[3]	陈本智, 方志宏, 夏勇, 张灵, 兰守忍, 王利生. 基于X射线图像的厚钢管焊缝中气孔缺陷的自动检测[J]. 计算机应用, 2017, 37(3): 849-853.
[4]	袁大曾, 何明星, 李虓, 曾晟珂. 基于点函数秘密共享的私有信息检索协议[J]. 计算机应用, 2017, 37(2): 494-498.
[5]	李岩, 张博文, 郝红卫. 基于语义向量表示的查询扩展方法[J]. 计算机应用, 2016, 36(9): 2526-2530.
[6]	李景哲, 李太福, 辜小花, 邱奎. 基于动态核独立分量分析的高含硫天然气净化过程异常检测与诊断[J]. 计算机应用, 2015, 35(9): 2710-2714.
[7]	胡小生钟勇. 一种两层加权融合的排序算法[J]. 计算机应用, 2012, 32(12): 3331-3334.
[8]	鲁强李效恋王智广. 程序算法识别研究综述[J]. 计算机应用, 2012, 32(10): 2863-2868.
[9]	王竹毅杨建坡尹永超王振朝. 基于盲信号分离的回波抵消技术[J]. 计算机应用, 2012, 32(10): 2707-2710.
[10]	李劲张华吴浩雄向军. 基于特定领域的中文微博热点话题挖掘系统BTopicMiner[J]. 计算机应用, 2012, 32(08): 2346-2349.
[11]	李劲张华吴浩雄向军辜希武. 基于社会标注质量的文本分类模型框架[J]. 计算机应用, 2012, 32(05): 1335-1339.
[12]	林志垒晏路明. 基于独立分量分析的高光谱遥感影像决策树分类[J]. 计算机应用, 2012, 32(02): 524-527.
[13]	张新征. 基于多小波子带加权判别熵的SAR目标ICA特征提取及识别[J]. 计算机应用, 2011, 31(09): 2468-2472.
[14]	周书仁梁昔明. 融合独立分量分析与支持向量聚类的人脸表情识别方法[J]. 计算机应用, 2011, 31(06): 1605-1608.
[15]	舒朗舒勤苏静. 新息模型的独立分量分析方法[J]. 计算机应用, 2011, 31(02): 556-558.