计算机应用 ›› 2012, Vol. 32 ›› Issue (12): 3335-3338.DOI: 10.3724/SP.J.1087.2012.03335

• 人工智能 • 上一篇    下一篇

使用概念描述的中文短文本分类算法

杨天平1,2,朱征宇1,2   

  1. 1. 重庆大学 计算机学院,重庆 400030
    2. 重庆大学 软件工程重庆市重点实验室,重庆 400030
  • 收稿日期:2012-06-10 修回日期:2012-07-27 发布日期:2012-12-29 出版日期:2012-12-01
  • 通讯作者: 杨天平
  • 作者简介:杨天平(1986-),男,四川成都人,硕士研究生,主要研究方向:数据挖掘、文本分析、自然语言处理;〓朱征宇(1959-),男,重庆人,教授,博士,主要研究方向:Web智能检索、智能交通、数据挖掘。
  • 基金资助:
    科技部国家科技支撑计划重点项目

Algorithm for Chinese short-text classification using concept description

YANG Tian-ping1,2,ZHU Zheng-yu1,2   

  1. 1. School of Computer Science, Chongqing University, Chongqing 400030, China
    2. Software Engineering Chongqing Key Laboratory, Chongqing University, Chongqing 400030, China
  • Received:2012-06-10 Revised:2012-07-27 Online:2012-12-29 Published:2012-12-01
  • Contact: YANG Tian-ping

摘要: 针对短文本特征较少而导致使用传统文本分类算法进行分类效果并不理想的问题,提出了一种使用了概念描述的短文本分类算法,该方法首先构建出全局的语义概念词表;然后,使用概念词表分别对预测短文本和训练短文本概念化描述,使得预测短文本在训练集中找出拥有相似概念描述的训练短文本组合成预测长文本,同时将训练集内部的短文本也进行自组合形成训练长文本;最后,再使用传统的长文本分类算法进行分类。实验证明,该方法能够有效挖掘短文本内部隐含的语义信息,充分对短文本进行语义扩展,提高了短文本分类的准确度。

关键词: 短文本分类, 概念描述, 数据挖掘, 机器学习, 自然语言处理

Abstract: In order to solve the problem that traditional classification is not very satisfactory due to fewer text features in short text, an algorithm using concept description was presented. At first, a global semantic concept word list was built. Then the test set and training set were conceptualized by the global semantic concept word list to combine the test short texts by the same description of concept in the training set, and at the same time, training long texts were combined by the training short texts in the training set. At last, the long text was classified by traditional classification algorithm. The experiments show that the proposed method could mine implicit semantic information in short text efficiently while expanding short text on semantics adequately, and improving the accuracy of short text classification.

Key words: Short text classification, Description of concept, Data mining, Machine learning, Natural language processing