计算机应用 ›› 2017, Vol. 37 ›› Issue (9): 2684-2688.DOI: 10.11772/j.issn.1001-9081.2017.09.2684

• 数据科学与技术 • 上一篇    下一篇

逼真生成表格式数据的非时间属性关联模型

张锐1,2, 肖如良1,2, 倪友聪1,2, 杜欣1,2   

  1. 1. 福建师范大学 软件学院, 福州 350117;
    2. 福建省公共服务大数据挖掘与应用工程研究中心, 福州 350117
  • 收稿日期:2017-03-29 修回日期:2017-05-16 出版日期:2017-09-10 发布日期:2017-09-13
  • 通讯作者: 肖如良,xiaoruliang@163.com
  • 作者简介:张锐(1992-),男,湖北孝感人,硕士研究生,主要研究方向:大数据软件;肖如良(1966-),男,湖南娄底人,教授,博士,CCF高级会员,主要研究方向:大数据软件、Web智能推荐系统、软件工程、系统虚拟化;倪友聪(1976-),男,安徽合肥人,副教授,博士,主要研究方向:软件体系结构、移动云计算;杜欣(1979-),女,新疆石河子人,副教授,博士,主要研究方向:智能计算、计算复杂性、基于搜索的软件工程。
  • 基金资助:
    福建省科技计划重大项目(2016H6007);福州市市校合作项目(2016-G-40)。

Not-temporal attribute correlation model to generate table data realistically

ZHANG Rui1,2, XIAO Ruliang1,2, NI Youcong1,2, DU Xin1,2   

  1. 1. Faculty of Software, Fujian Normal University, Fuzhou Fujian 350117, China;
    2. Fujian Provincial Engineering Research Center of Public Service Big Data Mining and Application, Fuzhou Fujian 350117, China
  • Received:2017-03-29 Revised:2017-05-16 Online:2017-09-10 Published:2017-09-13
  • Supported by:
    This work is partially supported by the Major Project of Fuijian Provincial Science and Technology Plan (2016H6007), Fuzhou City School Cooperation Project (2016-G-40).

摘要: 针对数据仿真过程中表格数据属性间关联难的问题,提出一种刻画表格数据中非时间属性间关联特征的H模型。首先,从数据集中提取评价主体和被评价主体关键属性,进行两重频数统计,得到关于关键属性的4个关系对;然后,计算各关系对的最大信息系数(MIC)来评估各关系对的相关性,并采用拉伸指数分布(SE)对各关系对进行关系拟合;最后,设置评价主体和被评价主体的数据规模,根据拟合出的关系计算出评价主体的活跃度和被评价主体的流行度,通过活跃度总和等于流行度总和建立关联,得到非时间属性关联的H模型。实验结果表明,利用H模型能有效地刻画真实数据集中非时间属性间的关联特征。

关键词: 数据仿真, 关联, 最大信息系数, 拉伸指数分布, 属性关联

Abstract: To solve the difficulty of attribute correlation in the process of simulating table data, an H model was proposed for describing not-temporal attribute correlation in table data. Firstly, the key attributes of the evaluation subject and the evaluated subject were extracted from the data set, by the twofold frequency statistics, four relationships of the key attributes were obtained. Then, the Maximum Information Coefficient (MIC) of each relationship was calculated to evaluate the correlation of each relationship, and each relationship was fitted by the Stretched Exponential (SE) distribution. Finally, the data scales of the evaluation subject and the evaluated subject were set. According to the result of fitting, the activity of the evaluation subject was calculated, and the popularity of the evaluated subject was calculated. H model was obtained through the association that was established by equal sum of activity and popularity. The experimental results show that H model can effectively describe the correlation characteristics of the non-temporal attributes in real data sets.

Key words: data simulation, correlation, Maximum Information Coefficient (MIC), Stretched Exponential (SE) distribution, attribute correlation

中图分类号: