• •    

分布式KNN算法在微信公众号分类中的应用

肖斌1,王锦阳2,任启强2   

  1. 1. 西南石油大学
    2. 四川省成都市新都区新都大道8号西南石油大学计算机科学学院
  • 收稿日期:2016-11-21 修回日期:2016-12-22 发布日期:2016-12-22
  • 通讯作者: 王锦阳

Application of distributed KNN algorithm in WeChat subscription classification

  • Received:2016-11-21 Revised:2016-12-22 Online:2016-12-22

摘要: 摘 要: 针对微信公众号数据量大幅增长与从事微信活动的人们对其有效信息获取效率低下的问题,提出对微信公众号信息进行梳理并快速并行化分类以及打标签的方法。首先,该方法在介绍微信公众号实际应用的前提下,以经典KNN分类算法为基础,实践并分析了单机KNN算法在效率上的不足;然后,采用Hadoop平台实现了基于MapReduce模型的KNN算法,对比了单机与分布式之间的效率以及对K值的调优,实验中的样本训练集通过人为指定,文本相似度的判别分为分词、特征词提取、权重计算、测试向量与训练向量夹角计算等步骤。在24个类别基础上,通过对1000万条公众号数据分类实验,为每个公众号打上了单标签或多标签,优化后的分类准确率达到82%,其中与生活相关的公众号数量占比达70%以上。研究表明使用分类后的结果,信息针对特定人群传播,传播的转化率有所提升;分布式KNN算法在微信公众号数据处理方面比单机算法具有更高的效率和鲁棒性。

关键词: 微信公众号, Hadoop平台, MapReduce模型, KNN, 分类

Abstract: Abstract: People who engage in WeChat commercials extract valuable information inefficiently when WeChat subscription data grows rapidly. To resolve the issue, a method of classifying and labeling the WeChat subscription data parallel is proposed. Firstly, with the premise that introduces practical application of WeChat subscription, the shortcoming of single node based on KNN classification algorithm is analysed. Then, the distributed KNN algorithm on the Hadoop platform using MapReduce application model is implemented, the efficiency between stand-alone and distributed algorithm is contrasted and K value is tune. In experiment, the training sample set is specified, text similarity between testing sample and training sample is divided into word segmentation, feature words extraction, weight calculation, cosine coefficient calculation. On the basis of the 24 categories, after the experiment of classifying ten million truthful data, every WeChat subscription is set single label or multi label, the classification accuracy after optimization reaches 82 percentage points, the number of the WeChat subscription associated with the life accounts for more than 70 percentage points. The research shows the transformation rate of information has been improved when used the classification results, the distributed KNN algorithm has higher efficiency and robustness than the single node algorithm for WeChat subscription data.

Key words: WeChat subscription, Hadoop platform, MapReduce model, KNN, classification

中图分类号: