Discovery of functional dependencies in university data based on affinity propagation clustering and TANE algorithms

doi:10.11772/j.issn.1001-9081.2019061050

Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (1): 90-95.DOI: 10.11772/j.issn.1001-9081.2019061050

• Data science and technology • Previous Articles Next Articles

Discovery of functional dependencies in university data based on affinity propagation clustering and TANE algorithms

HUANG Yongxin, TANG Xuefei

School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 610054, China

Received:2019-06-21 Revised:2019-09-05 Online:2020-01-10 Published:2019-10-10
Supported by:
This work is partially supported by the National Key Research and Development Program of China (2017YFB1401303), the Sichuan Science and Technology Program (2017GZ0192).

基于近邻传播聚类和TANE算法的高校数据中函数依赖的发现

黄永鑫, 唐雪飞

电子科技大学信息与软件工程学院, 成都 610054

通讯作者: 唐雪飞
作者简介:黄永鑫(1994-),女,四川自贡人,硕士研究生,主要研究方向:云计算、软件技术、数据挖掘;唐雪飞(1964-),男,四川成都人,副教授,博士,主要研究方向:人工智能、大数据、中间件。
基金资助:
国家重点研发计划项目（2017YFB1401303）；四川省科技计划项目（2017GZ0192）。

Abstract

Abstract: In view of the missing values of datasets and the number of found functional dependencies is small and inaccurate in actual data quality detection process of universities, a university functional dependency discovery method combining Affinity Propagation (AP) clustering and TANE algorithm (APTANE) was proposed. Firstly, the Chinese field in the dataset was parsed row by row, and the Chinese field values were represented by the corresponding numerical values. Then, the AP clustering algorithm was used to fill the missing values in the dataset. Finally, the TANE algorithm was used to automatically find out the functional dependencies satisfying non-trivial and minimum requirements from the processed dataset. The experimental results show that after using AP clustering algorithm to repair real university dataset, compared with the direct use of functional dependency automatic discovery algorithm, the number of functional dependencies found increases to 80. The functional dependencies found after the filling of missing values represent the relationship between fields more accurately, reducing the workload of domain experts and improving the quality of data held by universities.

Key words: university informationization, data quality, Affinity Propagation (AP) clustering algorithm, functional dependency, TANE

摘要： 针对高校实际数据质量检测过程中数据集存在缺失值以及发现的函数依赖个数较少且不准确的问题，提出了一种结合近邻传播（AP）聚类算法和TANE算法的高校函数依赖发现方法（APTANE）。首先，对数据集中的中文字段进行列剖析，将中文字段值用对应的数值来表示；其次，使用AP聚类算法对数据集中的缺失值进行填补；最后，使用TANE算法从处理好的数据集中自动发现出满足非平凡、最小要求的函数依赖。实验结果表明，在使用AP聚类算法对真实的高校数据集进行修复之后，相比于直接使用函数依赖自动发现算法，发现的函数依赖个数增加到了80个，经过缺失值填补后所发现的函数依赖在表示字段间关联关系时也更加准确，减少了领域专家的工作量，提升了高校数据所拥有数据的质量。

关键词: 高校信息化, 数据质量, 近邻传播聚类算法, 函数依赖, TANE

CLC Number:

TP311.1

HUANG Yongxin, TANG Xuefei. Discovery of functional dependencies in university data based on affinity propagation clustering and TANE algorithms[J]. Journal of Computer Applications, 2020, 40(1): 90-95.

黄永鑫, 唐雪飞. 基于近邻传播聚类和TANE算法的高校数据中函数依赖的发现[J]. 计算机应用, 2020, 40(1): 90-95.

References

[1] 李林,钱丹丹,黄婷婷,等.高校信息化数据治理探讨[J].中国教育信息化,2017(9):66-68.(LI L, QIAN D D, HUANG T T, et al. Discussion on the management of informationization data in colleges and universities[J]. The Chinese Journal of ICT in Education, 2017(9):66-68.)
[2] 徐峰,吴旻瑜,徐萱,等.教育数据治理:问题、思考与对策[J].开放教育研究,2018,24(2):107-112.(XU F, WU M Y, XU X, et al. Educational data governance:problems, reflections and countermeasures[J]. Open Education Research, 2018, 24(2):107-112.)
[3] MANNILA H, RAIHA K J. Dependency inference[C]//Proceedings of the 1987 International Conference on Very Large Data Bases. San Francisco:Morgan Kaufmann Publishers Inc., 1987:155-158.
[4] HUHTALA Y, KARKKAINEN J, PORKKA P, et al. Efficient discovery of functional and approximate dependencies using partitions[C]//Proceedings of the 14th International Conference on Data Engineering. Piscataway:IEEE, 1998:392-401.
[5] LOPES S, PETIT J M, LAKHAL L. Efficient discovery of functional dependencies and ARMSTRONG relations[C]//Proceedings of the 2000 International Conference on Extending Database Technology, LNCS 1777. Berlin:Springer, 2000:350-364.
[6] WYSS C, GIANNELLA C, ROBERTSON E. FastFDs:a heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances:extended abstract[C]//Proceedings of the 2001 International Conference on Data Warehousing and Knowledge Discovery, LNCS 2114. Berlin:Springer, 2001:101-110.
[7] HUHTALA Y, KARKKAINEN J, PORKKA P, et al. TANE:an efficient algorithm for discovering functional and approximate dependencies[J]. The Computer Journal, 1999, 42(2):100-111.
[8] NOVELLI N, CICCHETTI R. FUN:an efficient algorithm for mining functional and embedded dependencies[C]//Proceedings of the 2001 International Conference on Database Theory, LNCS 1973. Berlin:Springer, 2001:189-203.
[9] YAO H, HAMILTON H J, BUTZ C J. FD/spl I.bar/Mine:discovering functional dependencies in a database using equivalences[C]//Proceedings of the 2002 IEEE International Conference on Data Mining. Piscataway:IEEE, 2002:729-732.
[10] ABEDJAN Z, SCHULZE P, NAUMANN F. DFD:efficient functional dependency discovery[C]//Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. New York:ACM, 2014:949-958.
[11] FLACH P A, SAVNIK I. Database dependency discovery:a machine learning approach[J]. AI Communications, 1999, 12(3):139-160.
[12] PAPENBROCK T, EHRLICH J, MARTEN J, et al. Functional dependency discovery:an experimental evaluation of seven algorithms[J]. Proceedings of the VLDB Endowment, 2015, 8(10):1082-1093.
[13] AGRAWAL R, SRIKANT R. Fast algorithms for mining association rules in large databases[C]//Proceedings of the 20th International Conference on Very Large Data Bases. San Francisco:Morgan Kaufmann Publishers Inc., 1994:487-499.
[14] SEBASTIAN-COLEMAN L.数据质量测量的持续改进[M].卢涛,李颖,译.北京:机械工业出版社,2016:44-64.(SEBASTIAN-COLEMAN L. Measuring Data Quality for Ongoing Improvement[M]. LU T, LI Y, translated. Beijing:China Machine Press, 2016:44-64.)
[15] 余敏,赵晓南,许志.基于依赖的数据一致性研究进展[J].计算机应用,2018,38(S2):72-76,102.(YU M, ZHAO X N, XU Z. Survey on using dependencies to improve data consistency[J]. Journal of Computer Applications, 2018, 38(S2):72-76, 102.)
[16] FREY B J, DUECK D. Clustering by passing messages between data points[J]. Science, 2007, 315(5814):972-976.
[17] 郭秀娟,陈莹.AP聚类算法的分析与应用[J].吉林建筑大学学报,2013,30(4):58-61.(GUO X J, CHEN Y. Analysis and application on AP clustering algorithm[J]. Journal of Jilin Jianzhu University, 2013, 30(4):58-61.)
[18] 王开军,张军英,李丹,等.自适应仿射传播聚类[J].自动化学报,2007,33(12):1242-1246.(WANG K J, ZHANG J Y, LI D, et al. Adaptive affinity propagation clustering[J]. Acta Automatica Sinica, 2007, 33(12):1242-1246.)
[19] 董俊,王锁萍,熊范纶.可变相似性度量的近邻传播聚类[J].电子与信息学报,2010,32(3):509-514.(DONG J, WANG S P, XIONG F L. Affinity propagation clustering based on variable-similarity measure[J]. Journal of Electronics and Information Technology, 2010, 32(3):509-514.)
[20] 肖宇,于剑.基于近邻传播算法的半监督聚类[J].软件学报,2008,19(11):2803-2813.(XIAO Y, YU J. Semi-supervised clustering based on affinity propagation algorithm[J]. Journal of Software, 2008, 19(11):2803-2813.)

Discovery of functional dependencies in university data based on affinity propagation clustering and TANE algorithms

基于近邻传播聚类和TANE算法的高校数据中函数依赖的发现

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	HUANG Shaowei, HUANG Wanlin, LEI Runlong, MAO Xuesong. Simultaneous measurement of range and speed based on pulse position and amplitude modulation [J]. Journal of Computer Applications, 2021, 41(7): 2145-2149.
[2]	SHI Anni, LI Taoshen, WANG Zhe, HE Lu. Relay selection strategy for cache-aided full-duplex simultaneous wireless information and power transfer system [J]. Journal of Computer Applications, 2021, 41(6): 1539-1545.
[3]	YANG Mengmeng, ZHANG Aihua. Fractal image compression based on gray-level co-occurrence matrix and simultaneous orthogonal matching pursuit [J]. Journal of Computer Applications, 2021, 41(5): 1445-1449.
[4]	ZHENG Sicheng, KONG Linghua, YOU Tongfei, YI Dingrong. Semantic SLAM algorithm based on deep learning in dynamic environment [J]. Journal of Computer Applications, 2021, 41(10): 2945-2951.
[5]	WEI Wenle, JIN Guodong, TAN Lining, LU Libin, CHEN Danqi. Real-time SLAM algorithm with keyframes determined by inertial measurement unit [J]. Journal of Computer Applications, 2020, 40(4): 1157-1163.
[6]	CHEN Yuna, SHI Xiaodong. Improving machine simultaneous interpretation by punctuation recovery [J]. Journal of Computer Applications, 2020, 40(4): 972-977.
[7]	ZHANG Qinghua, WU Guangpu. Modeling and memetic algorithm for vehicle routing problem with simultaneous pickup-delivery and time windows [J]. Journal of Computer Applications, 2020, 40(4): 1097-1103.
[8]	ZHOU Yening, LI Taoshen, ZENG Min, XIAO Nan. Minimizing transmit power sum of full-duplex relay system with simultaneous wireless information and power transmission [J]. Journal of Computer Applications, 2020, 40(2): 363-368.
[9]	ZHAO Hong, LIU Xiangdong, YANG Yongjuan. Indoor robot simultaneous localization and mapping based on RGB-D image [J]. Journal of Computer Applications, 2020, 40(12): 3637-3643.
[10]	XI Zhihong, WANG Hongxu, HAN Shuangquan. Fast mismatch elimination algorithm and map-building based on ORB-SLAM2 system [J]. Journal of Computer Applications, 2020, 40(11): 3289-3294.
[11]	DING Doujian, ZHAO Xiaolin, WANG Changgen, GAO Guangen, KOU Lei. Autonomous localization and obstacle detection method of robot based on vision [J]. Journal of Computer Applications, 2019, 39(6): 1849-1854.
[12]	HUANG Shuai, FU Guangyuan, WU Ming, YUE Min. Multi-mode filtering object tracking algorithm based on monocular suboptimal parallax under unknown environment [J]. Journal of Computer Applications, 2019, 39(3): 864-868.
[13]	XI Zhihong, HAN Shuangquan, WANG Hongxu. Simultaneous localization and semantic mapping of indoor dynamic scene based on semantic segmentation [J]. Journal of Computer Applications, 2019, 39(10): 2847-2851.
[14]	WANG Taochun, LIU Tingting, LIU Shen, HE Guodong. Participant reputation evaluation scheme in crowd sensing [J]. Journal of Computer Applications, 2018, 38(3): 753-757.
[15]	HU Zhangfang, BAO Hezhang, CHEN Xu, FAN Tingkai, ZHAO Liming. Visual simultaneous location and mapping based on improved closed-loop detection algorithm [J]. Journal of Computer Applications, 2018, 38(3): 873-878.