Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (7): 2040-2048.DOI: 10.11772/j.issn.1001-9081.2022071130

• The 39th CCF National Database Conference (NDBC 2022) • Previous Articles    

GTC: geo-distributed triangle counting algorithm in graph streams

Chunze CAO1, Delong MA1, Ye YUAN2()   

  1. 1.School of Computer Science and Engineering,Northeastern University,Shenyang Liaoning 110169,China
    2.School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China
  • Received:2022-07-12 Revised:2022-08-28 Accepted:2022-09-05 Online:2023-07-20 Published:2023-07-10
  • Contact: Ye YUAN
  • About author:CAO Chunze, born in 1997, M. S. candidate. His research interests include geo-distributed computing, triangle counting.
    MA Delong, born in 1990, Ph. D. candidate. His research interests include geo-distributed processing and analysis of large graph data.
    YUAN Ye, born in 1981, Ph. D., professor. His research interests include big data management, database theory and system.
  • Supported by:
    National Natural Science Foundation of China(61932004)

跨域环境下图流三角计数算法GTC

曹春泽1, 马德龙1, 袁野2()   

  1. 1.东北大学 计算机科学与工程学院, 沈阳 110169
    2.北京理工大学 计算机学院, 北京 100081
  • 通讯作者: 袁野
  • 作者简介:曹春泽(1997—),男,安徽合肥人,硕士研究生,主要研究方向:跨域计算、三角计数;
    马德龙(1990—),男,辽宁沈阳人,博士研究生,主要研究方向:大图数据的跨域处理与分析;
    袁野(1981—),男,辽宁沈阳人,教授,博士,CCF高级会员,主要研究方向:大数据管理、数据库理论与系统。
  • 基金资助:
    国家自然科学基金资助项目(61932004)

Abstract:

The existing distributed triangle counting algorithms assume that all computing nodes are in the same location, while in reality, the nodes may be located in multiple data centers across continents. Geo-distributed data centers which are connected with wide area networks have characteristics of heterogeneous network bandwidth, high communication cost and uneven distribution, and the existing distributed algorithms cannot be applied to geo-distributed environment. At the same time, the existing research which ignores the temporal locality of the formation of triangles mostly adopts strategies such as random sampling and elimination of edges. Therefore, the triangle counting problem of real graph streams in geo-distributed environment was studied, and a Geo-distributed Triangle Counting (GTC) algorithm was proposed. Firstly, aiming at the problem of too high data transmission caused by the existing edge distribution strategy, a geo-distributed edge distribution strategy was proposed to build a benefit formula combining the time benefit and data benefit of communication and use point-to-point communication to replace broadcast edges. Then, for the triangle repeated counting problem caused by point-to-point communication in geo-distributed environment, a final edge calculation rule was proposed to ensure no counting repetition. Finally, based on the time weighted sampling algorithm, a time-weighted triangle counting algorithm was proposed to use the time locality of the triangle to sample. The GTC was compared with Conditional Counting and Sampling (CoCoS) and Tri-Fly on five real graph streams. The results show that GTC has the communication data size decreased by 17% compared to CoCoS and decreased by 44% compared to Tri-Fly, GTC has the error rate decreased by 53% compared to Tri-Fly and slightly less than CoCoS, and GTC has the running time decreased by 34% compared to Tri-Fly and slightly more than CoCoS. It can be seen that the GTC can reduce the size of communication data effectively while ensuring high accuracy and short algorithm running time.

Key words: geo-distributed, graph stream, triangle counting, approximate computing, sampling

摘要:

现有的分布式三角计数算法假设所有计算节点位于同一地理位置,然而现实中它们可能位于跨洲际的多个数据中心中。跨域分布的数据中心使用广域网连接,具有网络带宽异质、通信费用高昂、分布不均等特点,而现有分布式算法无法适用于跨域环境。同时,现有研究较多采用随机采样、淘汰边等策略,忽略了三角形的形成具有时间局部性的特点。因此,研究了跨域环境中真实图流的三角计数问题并提出跨域三角计数(GTC)算法。首先针对现有边分发策略导致数据传输量过高的问题,提出一种跨域边分发策略,以结合通信的时间收益和数据收益建立收益公式,并使用点对点通信代替广播边;然后对于点对点通信在跨域环境中导致的三角形重复计数问题,提出终边计算规则,以确保无重复计数;最后基于时间加权采样算法提出时间加权三角计数算法,以利用三角形的时间局部性特点采样。在5个图流上把GTC与CoCoS (Conditional Counting and Sampling)、Tri-Fly进行对比的结果表明:GTC在通信数据量上比CoCoS减少了17%,比Tri-Fly减少了44%;在误差率上GTC比Tri-Fly减小了53%,略低于CoCoS;在算法运行时间上GTC比Tri-Fly减少了34%,略高于CoCoS。可见,GTC在保证较高准确率与较短算法运行时间的情况下,能有效减少通信数据量。

关键词: 跨域, 图流, 三角计数, 近似计算, 采样

CLC Number: