计算机应用 ›› 2012, Vol. 32 ›› Issue (08): 2250-2257.DOI: 10.3724/SP.J.1087.2012.02250

• 人工智能 • 上一篇    下一篇

基于潜在狄利克雷分配模型和互信息的无监督特征选取法

董元元1,陈基漓1,唐小侠2   

  1. 1. 桂林理工大学 信息科学与工程学院,广西 桂林 541004
    2. 桂林理工大学 理学院,广西 桂林 541004
  • 收稿日期:2012-01-09 修回日期:2012-03-04 发布日期:2012-08-28 出版日期:2012-08-01
  • 通讯作者: 董元元
  • 作者简介:董元元(1988-),男,江苏盐城人,硕士研究生,主要研究方向:神经网络、文本分类;
    陈基漓(1972-),女,广西桂林人,副教授,硕士,主要研究方向:数据挖掘、智能计算;
    唐小侠(1984-),女,陕西宝鸡人,硕士研究生,主要研究方向:复杂网络。

Unsupervised feature selection method based on latent Dirichlet allocation model and mutual information

DONG Yuan-yuan1,CHEN Ji-li1,TANG Xiao-xia2   

  1. 1. College of Information Science and Engineering, Guilin University of Technology, Guilin Guangxi 541004, China
    2. College of Science, Guilin University of Technology, Guilin Guangxi 541004, China
  • Received:2012-01-09 Revised:2012-03-04 Online:2012-08-28 Published:2012-08-01
  • Contact: DONG Yuan-yuan

摘要: 为解决互信息(MI)在特征选取中的类别缺失和倾向低频词问题,提出 LDA-σ方法。该方法使用潜在狄利克雷分配模型(LDA)提取潜在主题,以“词—主题”间互信息的标准差作为特征评估函数。在Reuters-21578语料集上提取特征词并进行分类,LDA-σ方法的微平均F1最高达0.9096;宏平均F1优于其他算法,最高达0.7823。实验表明,LDA-σ方法可用于文本特征选取。

关键词: 潜在狄利克雷分配模型, 互信息, 评价函数

Abstract: To solve the category-deficiency and the tendency of selecting low-frequency words in feature selection process based on Mutual Information (MI), the method named LDA-σ was presented. Firstly, the latent topics were extracted by the Latent Dirichlet Allocation (LDA) model, and then the standard deviation of "Word-Topic" MI was calculated as the feature evaluation function. When conducting feature selection and categorization in Reuters-21578, the micro average F1 of LDA-σ reached up to 0.9096, and the highest macro average F1 of LDA-σ was 0.7823, which were higher than that of other algorithms. The experimental results indicate that LDA-σ can be applied to feature selection in text sets.

Key words: Latent Dirichlet Allocation (LDA) model, Mutual Information (MI), evaluation function

中图分类号: