计算机应用 ›› 2012, Vol. 32 ›› Issue (05): 1335-1339.

• 人工智能 • 上一篇    下一篇

基于社会标注质量的文本分类模型框架

李劲1,2,张华1,吴浩雄1,向军1,辜希武3   

  1. 1. 湖北民族学院 信息工程学院,湖北 恩施 445000
    2. 华中师范大学 信息管理系,武汉 430079
    3. 华中科技大学 计算机科学与技术学院,武汉 430074
  • 收稿日期:2011-11-21 修回日期:2012-01-02 发布日期:2012-05-01 出版日期:2012-05-01
  • 通讯作者: 李劲
  • 作者简介:李劲(1973-),男,湖北恩施人,副教授,博士研究生,主要研究方向:基于互联网的数据挖掘和数据管理、面向云计算的Web服务及Web服务组合、计算机网络应用及安全、信息管理;张华(1978-),男,湖北恩施人,讲师,硕士,主要研究方向:网络应用;吴浩雄(1979-),男,湖北建始人,工程师,主要研究方向:网络应用及安全;向军(1978-)男,湖北来凤人,讲师,博士,主要研究方向:移动计算、实时数据库系统、软件测试;辜希武(1967-),男,江西南昌人,讲师,博士,主要研究方向:数据挖掘、信息检索、分布式计算。
  • 基金资助:

    国家自然科学基金资助项目(61040006);湖北省自然科学基金资助项目(2010CDZ027);湖北省教育厅科技项目(B20101909)

Text classification model framework based on social annotation quality

LI Jin1,2,ZHANG Hua3,WU Hao-xiong3,XIANG Jun3,GU Xi-wu4   

  1. 1. Information Management Department,Central China Normal University, Wuhan Hubei 430074, China
    2. School of Information Engineering, Hubei University for Nationalities, Enshi Hubei 445000,China
    3. School of Information Engineering, Hubei University for Nationalities, Enshi Hubei 445000, China
    4. School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan Hubei 430074, China
  • Received:2011-11-21 Revised:2012-01-02 Online:2012-05-01 Published:2012-05-01
  • Contact: LI Jin

摘要: 社会标注是一种用户对网络资源的大众分类,蕴含了丰富的语义信息,因此将社会标注应用到信息检索技术中有助于提高信息检索的质量。研究了一种基于社会标注的文本分类改进算法以提高网页分类的效果。由于社会标注属于大众分类,标注的产生具有很大的随意性,标注的质量差别很大,因此首先利用文档间的语义相似度以及标注间的语义相似度来对标注的质量进行量化评估。在此基础上对标注进行质量过滤,利用质量相对较好的标注对文档向量空间模型进行扩展,将文档表示成由文档单词以及文档标注信息组成的扩展向量。同时采用支持向量机分类算法进行分类实验。实验结果表明,通过对标注进行质量评估并过滤质量差的标注,同时结合文档内容以及标注来对文档能提高分类的效果,同传统的基于文档内容的分类算法相比,分类结果的F1度量值提高了6.2%。

关键词: 社会标注, 向量空间模型, 文本分类, 信息检索, 数据挖掘

Abstract: Social annotation is a form of folksonomy, which allows Web users to categorize Web resource with text tags freely. It usually implicates fundamental and valuable semantic information of Web resources. Consequently, social annotation is helpful to improve the quality of information retrieval when applied to information retrieval system. This paper investigated and proposed an improved text classification algorithm based on social annotation. Because social annotation is a kind of folksonomy and social tags are usually generated arbitrarily without any control or expertise knowledge, there has been significant variance in the quality of social tags. Under this consideration, the paper firstly proposed a quantitative approach to measure the quality of social tags by utilizing the semantic similarity between Web pages and social tags. After that, the social tags with relatively low quality were filtered out based on the quality measurement and the remained social tags with high quality were applied to extend traditional vector space model. In the extended vector space model, a Web page was represented by a vector in which the components were the words in the Web page and tags tagged to the Web page. At last, the support vector machine algorithm was employed to perform the classification task. The experimental results show that the classification result can be improved after filtering out the social tags with low quality and embedding those high quality social tags into the traditional vector space model. Compared with other classification approaches, the classification result of F1 measurement has increased by 6.2% on average when using the proposed algorithm.

Key words: social annotation, vector space model, text classification, information retrieval, data mining

中图分类号: