计算机应用 ›› 2018, Vol. 38 ›› Issue (1): 159-164.DOI: 10.11772/j.issn.1001-9081.2017071660

• 数据科学与技术 • 上一篇    下一篇

基于云计算和改进K-means算法的海量用电数据分析方法

张承畅1, 张华誉2, 罗建昌1, 何丰1   

  1. 1. 重庆邮电大学 光电工程学院, 重庆 400065;
    2. 重庆邮电大学 通信与信息工程学院, 重庆 400065
  • 收稿日期:2017-07-04 修回日期:2017-08-21 出版日期:2018-01-10 发布日期:2018-01-22
  • 通讯作者: 张华誉
  • 作者简介:张承畅(1975-),男,湖北利川人,副教授,博士,主要研究方向:能源互联网、电力大数据、数据挖掘、信息物理系统;张华誉(1990-),男,安徽合肥人,硕士研究生,主要研究方向:数据挖掘;罗建昌(1990-),男,湖北荆州人,硕士研究生,主要研究方向:信息物理系统、大数据;何丰(1962-),男,重庆人,教授,主要研究方向:大数据、通信技术。
  • 基金资助:
    中国电力科学研究院科技基金资助项目(XXB51201603155);国网北京经济技术研究院科技基金资助项目(15JS191)。

Massive data analysis of power utilization based on improved K-means algorithm and cloud computing

ZHANG Chengchang1, ZHANG Huayu2, LUO Jianchang1, HE Feng1   

  1. 1. College of Optoelectronic Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;
    2. College of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • Received:2017-07-04 Revised:2017-08-21 Online:2018-01-10 Published:2018-01-22
  • Supported by:
    This work is partially supported by the Technology Foundation of China Electric Power Research Institute (XXB51201603155), the Technology Foundation of State Grid Economic and Technological Research Institute (15JS191).

摘要: 针对小区居民用电数据挖掘效率低、数据量大等难题,进行了基于云计算和改进K-means算法的海量用电数据分析方法研究。针对传统K-means算法中存在初始聚类中心和K值难确定的问题,提出一种基于密度的K-means改进算法。首先,定义样本密度、簇内样本平均距离的倒数和簇间距离三者乘积为权值积,通过最大权值积法依次确定聚类中心,提高了聚类的准确率;然后,基于MapReduce模型实现改进算法的并行化,提高了聚类的效率;最后,以小区400户家庭用电数据为基础,进行海量电力数据的挖掘分析实验。以家庭为单位,提取出用户的峰时耗电率、负荷率、谷电负荷系数以及平段用电量百分比,建立聚类的数据维度特征向量,完成相似用户类型的聚类,同时分析出各类用户的行为特征。基于Hadoop集群的实验结果证明提出的改进K-means算法运行稳定、可靠,具有很好的聚类效果。

关键词: 用电数据, 云计算, 改进K-means算法, MapReduce模型, 并行化

Abstract: For such difficulties as low mining efficiency and large amount of data that the data mining of residential electricity data has to be faced with, the analysis based on improved K-means algorithm and cloud computing on massive data of power utilization was researched. As the initial cluster center and the value K are difficult to determine in traditional K-means algorithm, an improved K-means algorithm based on density was proposed. Firstly, the product of sample density, the reciprocal of the average distance between the samples in the cluster, and the distance between the clusters were defined as weight product, the initial center was determined successively according to the maximum weight product method and the accuracy of the clustering was improved. Secondly, the parallelization of improved K-means algorithm was realized based on MapReduce model and the efficiency of clustering was improved. Finally, the mining experiment of massive power utilization data was carried out on the basis of 400 households' electricity data. Taking a family as a unit, such features as electricity consumption rate during peak hour, load rate, valley load coefficient and the percentage of power utilization during normal hour were calculated, and the feature vector of data dimension was established to complete the clustering of similar user types, at the same time, the behavioral characteristics of each type of users were analyzed. The experimental results on Hadoop cluster show that the improved K-means algorithm operates stably and efficiently and it can achieve better clustering effect.

Key words: power utilization data, cloud computing, improved K-means algorithm, MapReduce model, parallelization

中图分类号: