Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (11): 3184-3190.DOI: 10.11772/j.issn.1001-9081.2019051033

• The 2019 CCF Conference on Artificial Intelligence (CCFAI2019) • Previous Articles     Next Articles

Repairing of missing bus arrival data based on DBSCAN algorithm and multi-source data

WANG Cheng1, CUI Ziwei1, DU Zilin1, GAO Yueer2   

  1. 1. College of Computer Science and Technology, Huaqiao University, Xiamen Fujian 361021, China;
    2. School of Architecture, Huaqiao University, Xiamen Fujian 361021, China
  • Received:2019-05-24 Revised:2019-07-26 Online:2019-09-11 Published:2019-11-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China Youth Fund (51608209), the Project of Natural Science Foundation of Fujian Province (2017J01090), the Project of Guiding Plan of Fujian Province (2019H0017), the Project of Quanzhou Science and Technology Plan (2018Z008), the Project of Postgraduate Research Innovation Ability Cultivation of Huaqiao University (17013083017).

基于DBSCAN算法和多源数据的缺失公交到站数据修补

王成1, 崔紫薇1, 杜梓林1, 高悦尔2   

  1. 1. 华侨大学 计算机科学与技术学院, 福建 厦门 361021;
    2. 华侨大学 建筑学院, 福建 厦门 361021
  • 通讯作者: 王成
  • 作者简介:王成(1984-),男,湖北咸宁人,副教授,博士,CCF高级会员,主要研究方向:交通大数据、数据挖掘、机器学习;崔紫薇(1995-),女,河南商丘人,硕士研究生,CCF会员,主要研究方向:交通大数据、数据挖掘;杜梓林(1997-),男,山东滨州人,主要研究方向:交通大数据、数据挖掘;高悦尔(1983-),女,福建泉州人,副教授,博士,主要研究方向:交通大数据、数据挖掘。
  • 基金资助:
    国家自然科学基金青年基金资助项目(51608209);福建省自然基金面上项目(2017J01090);福建省引导性计划项目(2019H0017);泉州市科技计划项目(2018Z008);华侨大学研究生科研创新能力培育计划项目(17013083017)。

Abstract: In order to solve the problem that the existing repair methods for missing bus arrival information have little factors considered, low accuracy and poor robustness, a method to repair missing bus arrival data based on DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm and multi-source data was proposed. Bus GPS (Global Positioning System) data, IC (Integrated Circuit) card data and other source data were used to repair the missing arrival information. For the name, longitude and latitude data of the missing arrival station, the association analysis of complete arrival data and static line information were carried out to repair. For the missing arrival time data, the following steps were taken to repair. Firstly, for every missing data station and its nearest non-missing data station, the travel time and schedule in the historical complete arrival data between the two stations were clustered based on DBSCAN algorithm. Secondly, whether the two adjacent runs of the studied bus with complete data belonged to the same cluster was judged, and if they belonged to the same cluster, th cluster would not change, otherwise the two clusters would be merged. Finally, the maximum travel time corresponding to the cluster midpoint was used as the missing travel time to determine whether there was a passenger swiping his card to board the bus at this station or not, if so, the arrival time was calculated from the time of swiping cards, and if not, the mean of the maximum and minimum travel time corresponding to the cluster midpoint was used as the missing travel time to calculate the arrival time. Taking Xia'men bus arrival data as examples, in the repair of name, longitude and latitude of the missing arrival station, the clustering method based on GPS data, the maximum probability estimation method and the proposed method can repair the data by 100.00%. In the repair of missing arrival time, the mean relative error of the proposed method is 0.0301% and 0.0004% lower than that of two comparison methods respectively, and the correlation coefficient of the proposed method is 0.005 and 0.0075 higher than that of two comparison methods respectively. The simulation results show that the proposed method can effectively improve the accuracy of repair of missing bus arrival data, and reduce the impact of the number of missing stations on accuracy.

Key words: bus missing arrival data repair, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, longitude and latitude of arrival station, arrival time, multi-source data

摘要: 针对缺失公交到站信息修补方法考虑因素较少、准确度低、鲁棒性差的现状,提出了基于DBSCAN算法和多源数据的缺失公交到站数据修补方法。该方法使用公交全球定位系统(GPS)、公交集成电路卡(IC)等多源数据进行缺失到站信息的修补。对于缺失的到站名称、到站经纬度数据,用已有完整到站数据和静态线路信息关联分析进行修补。对于缺失的到站时刻数据,则按以下步骤进行修补:首先,对每一个缺失数据站点与其最近的未缺失数据站点,将这两站点间历史完整到站数据的行程时间和班次时序进行基于DBSCAN算法的聚类;其次,判断研究班次的两个相邻的数据完整的班次所属簇是否为同一个簇,若为同一个簇则不作改变,否则将两个簇合并;最后,将簇中点对应最大行程时间作为缺失行程时间判断是否有乘客在该站点上车刷卡,若有则由乘客开始刷卡时刻推算到站时刻,若无则将簇中点对应最大、最小行程时间的均值作为缺失行程时间推算到站时刻。以厦门市公交到站数据为例,在缺失到站名称、经纬度修补中,基于GPS数据聚类的方法、基于极大概率估计的方法和所提方法皆可进行100%的修补;在缺失到站时刻修补中,所提方法的平均相对误差比两种对比方法分别低0.0301%和0.0004%,相关系数比对比方法分别高0.005和0.0075。实验结果表明,所提算法在缺失公交到站数据修补中能有效提高修补的准确度,降低缺失站点个数变化对于准确度的影响。

关键词: 缺失到站数据修补, DBSCAN算法, 到站经纬度, 到站时刻, 多源数据

CLC Number: