计算机应用 ›› 2017, Vol. 37 ›› Issue (11): 3048-3052.DOI: 10.11772/j.issn.1001-9081.2017.11.3048

• 第十六届中国机器学习会议(CCML 2017) • 上一篇    下一篇

回归模型中哑变量的相对重要性指数

李海超1,2, 王开军1,2, 胡淼1,2, 陈黎飞1,2   

  1. 1. 福建师范大学 数学与信息学院, 福州 350007;
    2. 福建省网络安全与密码技术重点实验室(福建师范大学), 福州 350007
  • 收稿日期:2017-05-16 修回日期:2017-06-05 出版日期:2017-11-10 发布日期:2017-11-11
  • 通讯作者: 李海超
  • 作者简介:李海超(1990-),男,湖南临武人,硕士研究生,主要研究方向:机器学习、金融数据挖掘;王开军(1965-),男,福建福州人,副教授,博士,主要研究方向:机器学习、智能学习与推理、数据挖掘、模式识别;胡淼(1994-),男,安徽太和人,硕士研究生,主要研究方向:机器学习、数据挖掘;陈黎飞(1972-),男,福建福州人,教授,博士生导师,博士,主要研究方向:统计机器学习、数据挖掘、模式识别。
  • 基金资助:
    国家自然科学基金资助项目(61672157);福建师范大学网络与信息安全关键理论和技术创新团队项目(IRTL1207)。

Relative importance index of dummy variables in regression model

LI Haichao1,2, WANG Kaijun1,2, HU Miao1,2, CHEN Lifei1,2   

  1. 1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou Fujian 350007, China;
    2. Fujian Province Network Security and Cryptography Laboratory(Fujian Normal University), Fuzhou Fujian 350007, China
  • Received:2017-05-16 Revised:2017-06-05 Online:2017-11-10 Published:2017-11-11
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61672157), the Project of Network and Information Security Key Theory and Technological Innovation Team in Fujian Normal University (IRTL1207).

摘要: 为在回归模型中描述定性属性,通常需要引入哑变量。对含哑变量的回归方程,提出描述不同哑变量在回归方程中不同重要程度的方法。该方法分解出含哑变量的回归方程中哑变量部分和非哑变量部分的回归平方和,计算这两部分在该回归方程中所起作用的占比,将该占比设计为各哑变量在回归方程中的相对重要程度指数。在近10万笔的Lending Club和Prosper网络借贷数据集上,所进行的挖掘借款用途对借款成功率、信用等级对借款利率的影响程度的实验结果表明,与传统回归方程仅提供哑变量前的系数却不能展现其重要程度相比,所提方法展现出不同哑变量的不同重要程度,为定量分析回归方程中定性自变量对因变量的影响程度提供了重要的手段。

关键词: 定性属性, 回归方程, 哑变量, 指数

Abstract: To describe the qualitative attributes in the regression model, it is usually necessary to introduce dummy variables. For the regression equation with dummy variables, a method was proposed to describe the different importance of the different dummy variables in the regression equation. The sums of square due to regression with dummy variables were descomposed, including the sum of the dummy variable part and that of non-dummy variable part, and the proportions of the two parts was calculated in the regression equation, and the proportion was taken as the index of relative importance of every dummy variable in regression equations. In sets of Lending Club and Prosper network with nearly 100 thousand lending data, the experimental results about the influence of the purpose of loan on the borrowing success rate and the influence of credit grade on the borrowing rate show that compared with the traditional regression equation which only provides a dummy variable coefficient and cannot shows its importance, the proposed method can show the importance of different dummy variables, and provide an important means to quantitatively analyze the influence degree of qualitative independent variables on the dependent variable in the regression equation.

Key words: qualitative attribute, regression equation, dummy variable, index

中图分类号: