计算机应用 ›› 2017, Vol. 37 ›› Issue (3): 628-634.DOI: 10.11772/j.issn.1001-9081.2017.03.628

• 第四届大数据学术会议(CCF BIGDATA2016) • 上一篇    下一篇

以LDA为例的大规模分布式机器学习系统分析

唐黎哲1,2, 冯大为1,2, 李东升1,2, 李荣春1,2, 刘锋1,2   

  1. 1. 并行与分布处理国家重点实验室(国防科学技术大学), 长沙 410073;
    2. 国防科学技术大学 计算机学院, 长沙 410073
  • 收稿日期:2016-09-21 修回日期:2016-09-30 出版日期:2017-03-10 发布日期:2017-03-22
  • 通讯作者: 冯大为
  • 作者简介:唐黎哲(1991-),男,湖南永州人,硕士研究生,主要研究方向:并行计算、机器学习;冯大为(1985-),男,湖南湘潭人,助理研究员,博士,主要研究方向:并行计算、机器学习;李东升(1978-),男,安徽安庆人,研究员,博士,主要研究方向:并行计算、机器学习;李荣春(1985-),男,安徽无为人,助理研究员,博士,主要研究方向:机器学习;刘锋(1978-),男,湖南长沙人,副研究员,博士,主要研究方向:并行计算。
  • 基金资助:
    国家自然科学基金资助项目(61222205)。

Analysis of large-scale distributed machine learning systems: a case study on LDA

TANG Lizhe1,2, FENG Dawei1,2, LI Dongsheng1,2, LI Rongchun1,2, LIU Feng1,2   

  1. 1. National Laboratory for Parallel and Distributed Processing(National University of Defense Technology), Changsha Hunan 410073, China;
    2. College of Computer, National University of Defense Technology, Changsha Hunan 410073, China
  • Received:2016-09-21 Revised:2016-09-30 Online:2017-03-10 Published:2017-03-22
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61222205).

摘要: 针对构建大规模机器学习系统在可扩展性、算法收敛性能、运行效率等方面面临的问题,分析了大规模样本、模型和网络通信给机器学习系统带来的挑战和现有系统的应对方案。以隐含狄利克雷分布(LDA)模型为例,通过对比三款开源分布式LDA系统——Spark LDA、PLDA+和LightLDA,在系统资源消耗、算法收敛性能和可扩展性等方面的表现,分析各系统在设计、实现和性能上的差异。实验结果表明:面对小规模的样本集和模型,LightLDA与PLDA+的内存使用量约为Spark LDA的一半,系统收敛速度为Spark LDA的4至5倍;面对较大规模的样本集和模型,LightLDA的网络通信总量与系统收敛时间远小于PLDA+与SparkLDA,展现出良好的可扩展性。“数据并行+模型并行”的体系结构能有效应对大规模样本和模型的挑战;参数弱同步策略(SSP)、模型本地缓存机制和参数稀疏存储能有效降低网络开销,提升系统运行效率。

关键词: 隐含狄利克雷分布, 主题模型, 文本聚类, 吉布斯采样, 变分贝叶斯推理, 机器学习

Abstract: Aiming at the problems of scalability, algorithm convergence performance and operational efficiency in building large-scale machine learning systems, the challenges of the large-scale sample, model and network communication to the machine learning system were analyzed and the solutions of the existing systems were also presented. Taking Latent Dirichlet Allocation (LDA) model as an example, by comparing three open source distributed LDA systems-Spark LDA, PLDA+ and LightLDA, the differences in system design, implementation and performance were analyzed in terms of system resource consumption, algorithm convergence performance and scalability. The experimental results show that the memory usage of LightLDA and PLDA+ is about half of Spark LDA, and the convergence speed is 4 to 5 times of Spark LDA in the face of small sample sets and models. In the case of large-scale sample sets and models, the network communication volume and system convergence time of LightLDA is much smaller than PLDA+ and SparkLDA, showing a good scalability. The model of "data parallelism+model parallelism" can effectively meet the challenge of large-scale sample and model. The mechanism of Stale Synchronous Parallel (SSP) model for parameters, local caching mechanism of model and sparse storage of parameter can reduce the network cost effectively and improve the system operation efficiency.

Key words: Latent Dirichlet Allocation (LDA), topic model, text clustering, Gibbs sampling, variational Bayes inference, machine learning

中图分类号: