Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (3): 792-796.DOI: 10.11772/j.issn.1001-9081.2015.03.792

Previous Articles     Next Articles

Short question classification based on semantic extensions

YE Zhonglin1, YANG Yan1, JIA Zhen1, YIN Hongfeng2   

  1. 1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 610031, China;
    2. DOCOMO Innovations Incorporation, Palo Alto CA, 94304 USA
  • Received:2014-10-16 Revised:2014-11-18 Online:2015-03-10 Published:2015-03-13

基于语义扩展的短问题分类

冶忠林1, 杨燕1, 贾真1, 尹红风2   

  1. 1. 西南交通大学 信息科学与技术学院, 成都 610031;
    2. DOCOMO Innovations公司, 美国加州 帕罗奥图, 94304
  • 通讯作者: 杨燕
  • 作者简介:冶忠林(1989-),男(回族),青海西宁人,硕士研究生,主要研究方向:自然语言处理;杨燕(1964-),女,安徽合肥人,教授,博士生导师,博士,主要研究方向:数据挖掘、计算智能、集成学习;贾真(1975-),女,河南开封人,讲师,博士,主要研究方向:信息抽取、知识工程;尹红风(1967-),男,河南夏邑人,教授,博士,主要研究方向:语义搜索、大数据
  • 基金资助:

    国家自然科学基金资助项目(61170111,61262058)

Abstract:

Question classification is one of the tasks in question answering system. Since questions often have rare words and colloquial expressions, especially in the application of voice interaction, the traditional text classifications perform poorly in short question classification. Thus a short question classification algorithm was proposed, which was based on semantic extensions and used the search engine to extend knowledge for short questions, the question's category was got by selecting features with the topic model and calculating the word similarity. The experimental results show that the proposed method can get F-measure value of 0.713 in a set of 1365 real problems, which is higher than that of Support Vector Machine (SVM), K-Nearest Neighbor (KNN) algorithm and maximum entropy algorithm. Therefore, the accuracy of the question classification can be improved by above method in question answering system.

Key words: topic model, question classification, search engine, question answering system

摘要:

问题分类是问答系统任务之一。特别是语音交互方式中,用户的提问较短,具有口语化特征,利用传统文本分类方法对问题进行分类的效果不佳。为此提出一种基于语义扩展的短问题分类方法,该方法使用搜索引擎对问题进行知识扩展;然后,使用主题模型进行特征词选择;最后,利用词语相似度计算获取问题的类别。实验结果表明,所提方法在1365条真实问题集上平均F-measure值达到0.713,其值高于支持向量机(SVM)、K近邻(KNN)算法和最大熵方法。因此,该方法在问答系统中可以帮助系统提升问题分类的准确率。

关键词: 主题模型, 问题分类, 搜索引擎, 问答系统

CLC Number: