Journal of Computer Applications ›› 0, Vol. ›› Issue (): 364-369.DOI: 10.11772/j.issn.1001-9081.2023121732

• Frontier and comprehensive applications • Previous Articles     Next Articles

Diabetes prediction model based on improved TabNet

Shaoqin XU1, Bo PENG1(), Wudan LONG1, Danni DING2   

  1. 1.School of Computing and Artificial Intelligence,Southwest Jiaotong University,Chengdu Sichuan 611756,China
    2.School of Computer Science,Chengdu University of Information Technology,Chengdu Sichuan 610225,China
  • Received:2023-12-15 Revised:2024-02-13 Accepted:2024-02-26 Online:2025-01-24 Published:2024-12-31
  • Contact: Bo PENG

基于改进TabNet的糖尿病预测模型

徐绍钦1, 彭博1(), 龙伍丹1, 丁丹妮2   

  1. 1.西南交通大学 计算机与人工智能学院,成都 611756
    2.成都信息工程大学 计算机学院,成都 610225
  • 通讯作者: 彭博
  • 作者简介:徐绍钦(1999—),男,四川攀枝花人,硕士研究生,主要研究方向:特征选择、深度学习
    彭博(1980—),女,四川成都人,教授,博士,CCF会员,主要研究方向:计算机视觉、模式识别
    龙伍丹(1998—),女,重庆人,硕士研究生,主要研究方向:深度学习、目标检测
    丁丹妮(1999—),女,湖北松滋人,硕士研究生,主要研究方向:图像处理、生物视觉。
  • 基金资助:
    四川省科技创新苗子工程培育项目(MZGC20230077)

Abstract:

Addressing challenges posed by feature redundancy and unbalanced number of categories of diabetes data, an improved TabNet based diabetes prediction model — dTabNet (dual TabNet) was proposed. Firstly, the quantified global contribution indicators for features were obtained by dTabNet. At the same time, a triple-layer fully connected layer was constructed to enhance the model's representation capacity by replacing the single-layer fully connected layers in both feature-transformer and attention-transformer. Then, a feature selection module was designed to eliminate redundant and irrelevant features from the diabetes dataset, thereby enhancing learning efficiency. Finally, diabetes prediction was executed based on the feature subsets output by the feature selection module. Besides, the Focal loss function was introduced to optimize the loss function, addressing the issue of unbalanced number of categories of diabetes data. After that, the Bayesian optimization algorithm was applied to perform hyperparameter optimization for dTabNet. The proposed model was evaluated on a preprocessed real-world diabetes detection dataset. Experimental results demonstrate that dTabNet achieves a diabetes prediction accuracy of 90.9% and an Area Under the Curve (AUC) value of 94.7%. which are improved by 4.0 and 2.2 percentage points, respectively, compared to those of TabNet. On diabetes prediction data characterized by redundancy and unbalanced number of categories, the effectiveness of the proposed model is indicated.

Key words: diabetes prediction, TabNet, feature selection, unbalanced number of categories, tabular data, Bayesian optimization

摘要:

针对糖尿病数据特征冗余、类别数不平衡导致的预测困难问题,提出一种基于改进TabNet的糖尿病预测模型dTabNet(dual TabNet)。首先,通过dTabNet得到量化的特征全局贡献指标,同时构造3层全连接层替代特征转换器和注意力转换器中的单层全连接层,以增强对数据的表示能力;其次,设计特征选择模块,以去除糖尿病数据集中的冗余和无关特征,从而提高学习的效率;最后,根据特征选择模块输出的特征子集预测糖尿病。此外,引入Focal loss函数优化损失函数,以解决糖尿病数据类别数不平衡的问题;之后,利用贝叶斯优化算法对dTabNet进行超参数优化。在预处理后的真实糖尿病检测数据集上评估所提模型。实验结果表明,dTabNet针对糖尿病预测的准确率达到了90.9%,曲线下面积(AUC)值达到了94.7%,相较于TabNet的准确率和AUC值分别提高了4.0和2.2个百分点。这表明所提模型在数据冗余、类别数量不平衡的糖尿病数据上能有效地完成糖尿病预测。

关键词: 糖尿病预测, TabNet, 特征选择, 类别数量不平衡, 表格数据, 贝叶斯优化

CLC Number: