计算机应用 ›› 2012, Vol. 32 ›› Issue (11): 3018-3022.

• 人工智能 • 上一篇    下一篇

基于混合概率潜在语义分析模型的Web聚类

王治和1,王凌云2,党辉1,潘丽娜1   

  1. 1. 西北师范大学 计算机科学与工程学院,兰州 730070
    2. 兰州银行 科技部,兰州 730030
  • 收稿日期:2012-05-11 修回日期:2012-06-28 发布日期:2012-11-12 出版日期:2012-11-01
  • 通讯作者: 王治和
  • 作者简介:王治和(1965-),男,甘肃武威人,教授,主要研究方向:数据挖掘;王凌云(1986-),男,甘肃定西人,硕士研究生,主要研究方向:数据挖掘;党辉(1988-),女,甘肃永靖人,硕士,主要研究方向:数据挖掘;潘丽娜(1984-),女,甘肃平凉人,硕士,主要研究方向:数据挖掘。

Web clustering based on hybrid probabilistic latent semantic analysis model

WANG Zhi-he1,WANG Ling-yun2,DANG Hui1,PAN Li-na1   

  1. 1. College of Computer Science and Engineering,Northwest Normal University,Lanzhou Gansu 730070, China
    2. Department of Science and Technology, Lanzhou Bank, Lanzhou Gansu 730030, China
  • Received:2012-05-11 Revised:2012-06-28 Online:2012-11-12 Published:2012-11-01
  • Contact: WANG Zhi-he

摘要: 在电子商务应用中,为了更好地了解用户的内在特征,制定有效的营销策略,提出一种基于混合概率潜在语义分析(HPLSA)模型的Web聚类算法。利用概率潜在语义分析(PLSA)技术分别对用户浏览数据、页面内容信息及内容增强型用户事务数据建立PLSA模型, 通过对数—似然函数对三个PLSA模型进行合并得到用户聚类的HPLSA模型和页面聚类的HPLSA模型。聚类分析中以潜在主题与用户、页面以及站点之间的条件概率作为相似度计算依据,聚类算法采用基于距离的kmedoids 算法。设计并构建了HPLSA模型,在该模型上对Web聚类算法进行验证,表明该算法是可行的。

关键词: Web聚类, 概率潜在语义分析, 潜在主题, kmedoids算法

Abstract: In Ecommerce, in order to know more about the inherent characteristics of user access and make better marketing strategies, a Web clustering algorithm based on Hybrid Probabilistic Latent Semantic Analysis (HPLSA) model was proposed in this paper. The Probabilistic Latent Semantic Analysis (PLSA) models were established respectively on user browsing data, page information and enhanced user transaction data by using PLSA technology. Using loglikelihood function, three PLSA models were merged to get the user clustering HPLSA model and the page clustering HPLSA model. Similarity calculation was based on the conditional probability among latent themes and user, page as well as site in the clustering analysis. The kmedoids algorithm based on distance was adopted in this clustering algorithm. The HPLSA model was designed and constructed in this article, and the Web clustering algorithm was verified on this HPLSA model. Then it is proved that the algorithm is effective.

Key words: Web clustering, Probabilistic Latent Semantic Analysis (PLSA), latent theme, kmedoids algorithm

中图分类号: