Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (1): 90-95.DOI: 10.11772/j.issn.1001-9081.2019061050

• Data science and technology • Previous Articles     Next Articles

Discovery of functional dependencies in university data based on affinity propagation clustering and TANE algorithms

HUANG Yongxin, TANG Xuefei   

  1. School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 610054, China
  • Received:2019-06-21 Revised:2019-09-05 Online:2020-01-10 Published:2019-10-10
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2017YFB1401303), the Sichuan Science and Technology Program (2017GZ0192).

基于近邻传播聚类和TANE算法的高校数据中函数依赖的发现

黄永鑫, 唐雪飞   

  1. 电子科技大学 信息与软件工程学院, 成都 610054
  • 通讯作者: 唐雪飞
  • 作者简介:黄永鑫(1994-),女,四川自贡人,硕士研究生,主要研究方向:云计算、软件技术、数据挖掘;唐雪飞(1964-),男,四川成都人,副教授,博士,主要研究方向:人工智能、大数据、中间件。
  • 基金资助:
    国家重点研发计划项目(2017YFB1401303);四川省科技计划项目(2017GZ0192)。

Abstract: In view of the missing values of datasets and the number of found functional dependencies is small and inaccurate in actual data quality detection process of universities, a university functional dependency discovery method combining Affinity Propagation (AP) clustering and TANE algorithm (APTANE) was proposed. Firstly, the Chinese field in the dataset was parsed row by row, and the Chinese field values were represented by the corresponding numerical values. Then, the AP clustering algorithm was used to fill the missing values in the dataset. Finally, the TANE algorithm was used to automatically find out the functional dependencies satisfying non-trivial and minimum requirements from the processed dataset. The experimental results show that after using AP clustering algorithm to repair real university dataset, compared with the direct use of functional dependency automatic discovery algorithm, the number of functional dependencies found increases to 80. The functional dependencies found after the filling of missing values represent the relationship between fields more accurately, reducing the workload of domain experts and improving the quality of data held by universities.

Key words: university informationization, data quality, Affinity Propagation (AP) clustering algorithm, functional dependency, TANE

摘要: 针对高校实际数据质量检测过程中数据集存在缺失值以及发现的函数依赖个数较少且不准确的问题,提出了一种结合近邻传播(AP)聚类算法和TANE算法的高校函数依赖发现方法(APTANE)。首先,对数据集中的中文字段进行列剖析,将中文字段值用对应的数值来表示;其次,使用AP聚类算法对数据集中的缺失值进行填补;最后,使用TANE算法从处理好的数据集中自动发现出满足非平凡、最小要求的函数依赖。实验结果表明,在使用AP聚类算法对真实的高校数据集进行修复之后,相比于直接使用函数依赖自动发现算法,发现的函数依赖个数增加到了80个,经过缺失值填补后所发现的函数依赖在表示字段间关联关系时也更加准确,减少了领域专家的工作量,提升了高校数据所拥有数据的质量。

关键词: 高校信息化, 数据质量, 近邻传播聚类算法, 函数依赖, TANE

CLC Number: