计算机应用 ›› 2013, Vol. 33 ›› Issue (12): 3559-3562.

• 人工智能 • 上一篇    下一篇

云计算环境下基于代表点增量层次密度聚类的微博事件检测及跟踪

冯永1,2,韩楠1,2,贾东风1,2   

  1. 1. 信息物理社会可信服务计算教育部重点实验室(重庆大学),重庆 400044;
    2. 重庆大学 计算机学院,重庆 400044
  • 收稿日期:2013-06-14 修回日期:2013-08-13 出版日期:2013-12-01 发布日期:2013-12-31
  • 通讯作者: 冯永
  • 作者简介:冯永(1977-),男,山东平度人,副教授,博士,主要研究方向:云计算、数据挖掘;
    韩楠(1988-),女,河南商丘人,硕士研究生,主要研究方向:数据挖掘;
    贾东风(1987-),男,河南商丘人,硕士研究生,主要研究方向:云计算。
  • 基金资助:
    国家自然科学基金资助项目;国家科技支撑计划项目;中央高校基本科研业务基金资助项目;重庆市高等教育教学改革研究重点项目

Microblog Events Detection and Tracking based on RIHDBSCAN using Cloud Framework

FENG Yong1,2,HAN Nan1,2,JIA Dongfeng1,2   

  1. 1. College of Computer Science, Chongqing University, Chongqing 400044, China
    2. Key Laboratory of Dependable Service Computing in Cyber Physical Society, Ministry of Education (Chongqing University), Chongqing 400044 China
  • Received:2013-06-14 Revised:2013-08-13 Online:2013-12-31 Published:2013-12-01
  • Contact: FENG Yong

摘要: 为从微博服务平台产生的大量实时信息中抽取新闻事件,提出了一套完整的云计算环境下的微博事件检测跟踪算法。首先采用新的基于微博转发数和评论数的权值计算方法,将微博文本表示成向量空间模型;再利用基于代表点的增量层次密度聚类(RIHDBSCAN)算法抽取关键词,最终实现新闻事件的检测和跟踪。针对单一节点无法快速高效地处理海量微博数据的问题,将算法部署在云计算平台Hadoop上。通过在新浪微博平台上获取的真实数据进行实验,结果表明,所提出的权值计算方法比

关键词: 微博, 事件检测, 密度聚类算法, 云计算, Hadoop平台, 代表点

Abstract: For the purpose of events extraction from large-scale short posts of microblogging service, a complete event detection and tracking algorithm was proposed using cloud framework. First, based on the number of forward and comment of the microblog, the posts were expressed as Vector Space Model (VSM). Then the keywords were extracted using RIHDBSCAN (Incremental Hierarchical DBSCAN based on Representative posts) to realize the event detection and tracking. Considering that a single node cannot quickly and efficiently handle the large amount of data, the algorithm would be deployed on Hadoop, a cloud computing platform. The experiment on real microblog data extracted from Sina microblogging platform shows that the proposed method achieves higher performance than that of TF-IDF (Term Frequency-Inverse Document Frequency) and UF-ITUF (User Frequency-Inverse Thread User Frequency), and the use of cloud framework improves the processing speed. Therefore, it is suitable for data analysis and mining on huge datasets.

Key words: microblog, events detection, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), cloud computing, Hadoop platform, representative post

中图分类号: