《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (4): 1035-1041.DOI: 10.11772/j.issn.1001-9081.2024030366

• 人工智能 • 上一篇    下一篇

基于最小先验知识的自监督学习方法

朱俊屹1,2, 常雷雷1,2, 徐晓滨1,2(), 郝智勇3,4, 于海跃4, 姜江4   

  1. 1.中国 -奥地利人工智能与先进制造“一带一路”联合实验室(杭州电子科技大学),杭州 310018
    2.杭州电子科技大学 自动化学院,杭州 310018
    3.深圳信息职业技术学院 财经学院,广东 深圳 518172
    4.国防科技大学 系统工程学院,长沙 410073
  • 收稿日期:2024-04-02 修回日期:2024-06-20 接受日期:2024-06-21 发布日期:2024-10-11 出版日期:2025-04-10
  • 通讯作者: 徐晓滨
  • 作者简介:朱俊屹(2000—),男,浙江温州人,硕士研究生,主要研究方向:机器学习、数据处理
    常雷雷(1985—),男,河北沧州人,副研究员,博士,主要研究方向:复杂系统建模、推理与优化的机器学习方法
    徐晓滨(1980—),男,河南郑州人,教授,博士,CCF会员,主要研究方向:机器学习、模糊集理论
    郝智勇(1983—),男,内蒙古赤峰人,副教授,博士,主要研究方向:机器学习、复杂系统建模
    于海跃(1991—),男,河北唐山人,讲师,博士,主要研究方向:复杂系统、机器学习
    姜江(1981—),男,山东泰安人,教授,博士,主要研究方向:复杂系统建模、不确定性推理、风险决策。
  • 基金资助:
    国家重点研发计划项目(2022YFE0210700);国家自然科学基金资助项目(72471767);浙江省基础公益研究计划项目(LTGG23F030003);浙江省属高校基本科研业务费资助项目(GK239909299001?010)

Self-supervised learning method using minimal prior knowledge

Junyi ZHU1,2, Leilei CHANG1,2, Xiaobin XU1,2(), Zhiyong HAO3,4, Haiyue YU4, Jiang JIANG4   

  1. 1.China-Austria Belt and Road Joint Laboratory on Artificial Intelligence and Advanced Manufacturing (Hangzhou Dianzi University),Hangzhou Zhejiang 310018,China
    2.School of Automation,Hangzhou Dianzi University,Hangzhou Zhejiang 310018,China
    3.School of Finance and Economics,Shenzhen Institute of Information Technology,Shenzhen Guangdong 518172,China
    4.College of Systems Engineering,National University of Defense Technology,Changsha Hunan 410073,China
  • Received:2024-04-02 Revised:2024-06-20 Accepted:2024-06-21 Online:2024-10-11 Published:2025-04-10
  • Contact: Xiaobin XU
  • About author:ZHU Junyi, born in 2000, M. S. candidate. His research interests include machine learning, data processing.
    CHANG Leilei, born in 1985, Ph. D., associate research fellow. His research interests include complex system modeling, machine learning approaches for reasoning and optimization.
    XU Xiaobin, born in 1980, Ph. D., professor. His research interests include machine learning, fuzzy set theory.
    HAO Zhiyong, born in 1983, Ph. D., associate professor. His research interests include machine learning, complex system modeling.
    YU Haiyue, born in 1991, Ph. D., lecturer. His research interests include complex system, machine learning.
    JIANG Jiang, born in 1981, Ph. D., professor. His research interests include complex system modeling, uncertainty reasoning, decision under risk.
  • Supported by:
    National Key Research and Development Program of China(2022YFE0210700);National Natural Science Foundation of China(72471767);Zhejiang Province Public Welfare Research Program(LTGG23F030003);Fundamental Research Funds for Universities of Zhejiang Province(GK239909299001-010)

摘要:

为了弥补有监督学习对监督信息要求过高的不足,提出一种基于最小先验知识的自监督学习方法。首先,基于数据的先验知识聚类无标签数据,或基于有标签数据的中心距离为无标签数据生成初始标签;其次,随机抽取赋予标签后的数据,并选择机器学习方法建立子模型;再次,计算各个数据抽取的权重和误差,以求得数据平均误差作为各个数据集的数据标签度,并根据初始数据标签度设置迭代阈值;最后,比较迭代过程中数据标签度的大小和阈值决定是否达到终止条件。在10个UCI公开数据集上的实验结果表明,相较于无监督学习K-means等方法、有监督学习支持向量机(SVM)等算法和主流自监督学习TabNet(Tabular Network)等方法,所提方法在不平衡数据集不使用标签,或在平衡数据集上使用有限标签时仍可以取得较高的分类准确度。

关键词: 最小先验知识, 自监督学习, 机器学习, 数据标签度, 迭代阈值

Abstract:

In order to make up for the high demand of supervised information in supervised learning, a self-supervised learning method based on minimal prior knowledge was proposed. Firstly, the unlabeled data were clustered on the basis of the prior knowledge of data, or the initial labels were generated for unlabeled data based on center distances of labeled data. Secondly, the data were selected randomly after labeling, and the machine learning method was selected to build sub-models. Thirdly, the weight and error of each data extraction were calculated to obtain average error of the data as the data label degree for each dataset, and set an iteration threshold based on the initial data label degree. Finally, the termination condition was determined on the basis of comparing the data-label degree and the threshold during the iteration process. Experimental results on 10 UCI public datasets show that compared with unsupervised learning algorithms such as K-means, supervised learning methods such as Support Vector Machine (SVM) and mainstream self-supervised learning methods such as TabNet (Tabular Network), the proposed method achieves high classification accuracy on unbalanced datasets without using labels or on balanced datasets using limited labels.

Key words: minimal prior knowledge, self-supervised learning, machine learning, data-label degree, iteration threshold

中图分类号: