基于社会标注质量的文本分类模型框架

计算机应用 ›› 2012, Vol. 32 ›› Issue (05): 1335-1339.

基于社会标注质量的文本分类模型框架

李劲¹,²,张华¹,吴浩雄¹,向军¹,辜希武³

1. 湖北民族学院信息工程学院，湖北恩施 445000
2. 华中师范大学信息管理系，武汉 430079
3. 华中科技大学计算机科学与技术学院，武汉 430074

收稿日期:2011-11-21 修回日期:2012-01-02 发布日期:2012-05-01 出版日期:2012-05-01
通讯作者: 李劲
作者简介:李劲(1973－)，男,湖北恩施人，副教授,博士研究生,主要研究方向：基于互联网的数据挖掘和数据管理、面向云计算的Web服务及Web服务组合、计算机网络应用及安全、信息管理；张华(1978－),男,湖北恩施人，讲师,硕士,主要研究方向：网络应用；吴浩雄(1979-),男,湖北建始人，工程师,主要研究方向：网络应用及安全；向军（1978-）男，湖北来凤人，讲师，博士，主要研究方向：移动计算、实时数据库系统、软件测试；辜希武（1967-），男，江西南昌人，讲师，博士，主要研究方向：数据挖掘、信息检索、分布式计算。
基金资助:
国家自然科学基金资助项目（61040006）;湖北省自然科学基金资助项目（2010CDZ027）;湖北省教育厅科技项目（B20101909)

Text classification model framework based on social annotation quality

LI Jin¹,²,ZHANG Hua³,WU Hao-xiong³,XIANG Jun³,GU Xi-wu⁴

1. Information Management Department，Central China Normal University, Wuhan Hubei 430074, China
2. School of Information Engineering, Hubei University for Nationalities, Enshi Hubei 445000,China
3. School of Information Engineering, Hubei University for Nationalities, Enshi Hubei 445000, China
4. School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan Hubei 430074, China

Received:2011-11-21 Revised:2012-01-02 Online:2012-05-01 Published:2012-05-01
Contact: LI Jin

摘要/Abstract

摘要： 社会标注是一种用户对网络资源的大众分类，蕴含了丰富的语义信息，因此将社会标注应用到信息检索技术中有助于提高信息检索的质量。研究了一种基于社会标注的文本分类改进算法以提高网页分类的效果。由于社会标注属于大众分类，标注的产生具有很大的随意性，标注的质量差别很大，因此首先利用文档间的语义相似度以及标注间的语义相似度来对标注的质量进行量化评估。在此基础上对标注进行质量过滤，利用质量相对较好的标注对文档向量空间模型进行扩展，将文档表示成由文档单词以及文档标注信息组成的扩展向量。同时采用支持向量机分类算法进行分类实验。实验结果表明，通过对标注进行质量评估并过滤质量差的标注，同时结合文档内容以及标注来对文档能提高分类的效果，同传统的基于文档内容的分类算法相比，分类结果的F1度量值提高了6.2%。

关键词: 社会标注, 向量空间模型, 文本分类, 信息检索, 数据挖掘

Abstract: Social annotation is a form of folksonomy, which allows Web users to categorize Web resource with text tags freely. It usually implicates fundamental and valuable semantic information of Web resources. Consequently, social annotation is helpful to improve the quality of information retrieval when applied to information retrieval system. This paper investigated and proposed an improved text classification algorithm based on social annotation. Because social annotation is a kind of folksonomy and social tags are usually generated arbitrarily without any control or expertise knowledge, there has been significant variance in the quality of social tags. Under this consideration, the paper firstly proposed a quantitative approach to measure the quality of social tags by utilizing the semantic similarity between Web pages and social tags. After that, the social tags with relatively low quality were filtered out based on the quality measurement and the remained social tags with high quality were applied to extend traditional vector space model. In the extended vector space model, a Web page was represented by a vector in which the components were the words in the Web page and tags tagged to the Web page. At last, the support vector machine algorithm was employed to perform the classification task. The experimental results show that the classification result can be improved after filtering out the social tags with low quality and embedding those high quality social tags into the traditional vector space model. Compared with other classification approaches, the classification result of F1 measurement has increased by 6.2% on average when using the proposed algorithm.

Key words: social annotation, vector space model, text classification, information retrieval, data mining

中图分类号:

李劲张华吴浩雄向军辜希武. 基于社会标注质量的文本分类模型框架[J]. 计算机应用, 2012, 32(05): 1335-1339.

LI Jin ZHANG Hua WU Hao-xiong XIANG Jun GU Xi-wu. Text classification model framework based on social annotation quality[J]. Journal of Computer Applications, 2012, 32(05): 1335-1339.

参考文献

［1］ZHOU D, BIAN J, ZHENG S, et al. Exploring social annotations for information retrieval［C］// Proceedings of the 17st International Conference on World Wide Web. New York: ACM， 2008: 715-724.

［2］SIGURBJORNSSON B, ZWOL R V. Flickr tag recommendation based on collective knowledge［C］// Proceedings of the 17st International Conference on World Wide Web. New York:ACM Press,2008: 327-336.

［3］GUY I, ZWERDING N, RONEN I, et al. Social media recommendation based on people and tags［C］// Procceedings of the 33st International Conference on Special Interest Group on Information Retrieval. New York: ACM, 2010: 194-201.

［4］LU C, HU X, CHEN X, et al. The topic-perspective model for social tagging systems［C］// Proceedings of the 16st International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2010: 683-691.

［5］PERDO J S, SIERSDORFER S. Ranking and classifying attractiveness of photo in folksonomies［C］// Proceedings of the 18st International Conference on World Wide Web. New York: ACM, 2009: 771-780.

［6］NOLL M G, MEINEL,C. Exploring social annotations for Web document classification［C］// Proceedings of the 23st Annual ACM Symposium on Applied Computing. New York: ACM, 2008: 2315-2320.

［7］RAMAGE D, HEYMANN P, MANNING C D, et al. Clustering the tagged Web［C］// Proceedings of the second ACM International Conference on Web Search and Data Mining. New York: ACM, 2008:54-63.

［8］BAO S, WU X. Optimizing Web search using social annotations［C］// Proceedings of the 16st International Conference on World Wide Web. New York: ACM, 2007:501-510.

［9］RAMEZANI M, RAICU D, MOBASHER B. Web page recommendation in a social tagging system［EB/OL］. ［2011-04-14］. http://josquin.cs.depaul.edu/~mramezani/papers/.

［10］WU L, YANG L, YU N. Learning to tag［C］// Proceedings of the 18st International Conference on World Wide Web. New York: ACM, 2009: 361-370.

［11］XU S, BAO S, FEI B. Exploring folksonomy for personalized search［C］// Proceedings of the 31st International Conference on Special Interest Group on Information Retrieval. New York: ACM, 2008: 155-122.

［12］张俞, 孟宪学, 苏晓路. 网络标注的主要方法概述［J］. 图书情报工作, 2008, 52(1):20-22.

［13］黄国彬. 大众标注研究进展［J］. 图书情报工作, 2008, 52(1):13-15.

[1]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[2]	董瑶, 付怡雪, 董永峰, 史进, 陈晨. 不完整多视图聚类综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1673-1682.
[3]	余新言, 曾诚, 王乾, 何鹏, 丁晓玉. 基于知识增强和提示学习的小样本新闻主题分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1767-1774.
[4]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[5]	赵征宇, 罗景, 涂新辉. 基于多粒度语义融合的信息检索方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1775-1780.
[6]	余杭, 周艳玲, 翟梦鑫, 刘涵. 基于预训练模型与标签融合的文本分类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 709-714.
[7]	张家伟, 高冠东, 肖珂, 宋胜尊. 基于改进分层注意网络和TextCNN联合建模的暴力犯罪分级算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 403-410.
[8]	杨克帅, 武优西, 耿萌, 刘靖宇, 李艳. 一次性条件下top-k高平均效用序列模式挖掘算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 477-484.
[9]	王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417.
[10]	郑浩东, 马华, 谢颖超, 唐文胜. 融合遗忘因素与记忆门的图神经网络知识追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2747-2752.
[11]	于碧辉, 蔡兴业, 魏靖烜. 基于提示学习的小样本文本分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2735-2740.
[12]	崔雨萌, 王靖亚, 刘晓文, 闫尚义, 陶知众. 融合注意力和裁剪机制的通用文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2396-2405.
[13]	陆佳行, 戴华, 刘源龙, 周倩, 杨庚. 面向云环境密文排序检索的字典划分向量空间模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 1994-2000.
[14]	黄硕, 李艳辉, 曹建秋. 本地化差分隐私下的频繁序列模式挖掘算法PrivSPM[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2057-2064.
[15]	蒋华, 李星, 王慧娇, 韦静海. 基于数据索引结构的跨级高效用项集挖掘算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2200-2208.