基于最小先验知识的自监督学习方法

doi:10.11772/j.issn.1001-9081.2024030366

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (4): 1035-1041.DOI: 10.11772/j.issn.1001-9081.2024030366

基于最小先验知识的自监督学习方法

朱俊屹¹^,², 常雷雷¹^,², 徐晓滨¹^,²(), 郝智勇³^,⁴, 于海跃⁴, 姜江⁴

^1.中国 -奥地利人工智能与先进制造“一带一路”联合实验室（杭州电子科技大学），杭州 310018
^2.杭州电子科技大学自动化学院，杭州 310018
^3.深圳信息职业技术学院财经学院，广东深圳 518172
^4.国防科技大学系统工程学院，长沙 410073

收稿日期:2024-04-02 修回日期:2024-06-20 接受日期:2024-06-21 发布日期:2024-10-11 出版日期:2025-04-10
通讯作者: 徐晓滨
作者简介:朱俊屹（2000—），男，浙江温州人，硕士研究生，主要研究方向：机器学习、数据处理
常雷雷（1985—），男，河北沧州人，副研究员，博士，主要研究方向：复杂系统建模、推理与优化的机器学习方法
徐晓滨（1980—），男，河南郑州人，教授，博士，CCF会员，主要研究方向：机器学习、模糊集理论
郝智勇（1983—），男，内蒙古赤峰人，副教授，博士，主要研究方向：机器学习、复杂系统建模
于海跃（1991—），男，河北唐山人，讲师，博士，主要研究方向：复杂系统、机器学习
姜江（1981—），男，山东泰安人，教授，博士，主要研究方向：复杂系统建模、不确定性推理、风险决策。
基金资助:
国家重点研发计划项目(2022YFE0210700);国家自然科学基金资助项目(72471767);浙江省基础公益研究计划项目(LTGG23F030003);浙江省属高校基本科研业务费资助项目(GK239909299001?010)

Self-supervised learning method using minimal prior knowledge

Junyi ZHU¹^,², Leilei CHANG¹^,², Xiaobin XU¹^,²(), Zhiyong HAO³^,⁴, Haiyue YU⁴, Jiang JIANG⁴

^1.China-Austria Belt and Road Joint Laboratory on Artificial Intelligence and Advanced Manufacturing （Hangzhou Dianzi University），Hangzhou Zhejiang 310018，China
^2.School of Automation，Hangzhou Dianzi University，Hangzhou Zhejiang 310018，China
^3.School of Finance and Economics，Shenzhen Institute of Information Technology，Shenzhen Guangdong 518172，China
^4.College of Systems Engineering，National University of Defense Technology，Changsha Hunan 410073，China

Received:2024-04-02 Revised:2024-06-20 Accepted:2024-06-21 Online:2024-10-11 Published:2025-04-10
Contact: Xiaobin XU
About author:ZHU Junyi， born in 2000， M. S. candidate. His research interests include machine learning， data processing.
CHANG Leilei， born in 1985， Ph. D.， associate research fellow. His research interests include complex system modeling， machine learning approaches for reasoning and optimization.
XU Xiaobin， born in 1980， Ph. D.， professor. His research interests include machine learning， fuzzy set theory.
HAO Zhiyong， born in 1983， Ph. D.， associate professor. His research interests include machine learning， complex system modeling.
YU Haiyue， born in 1991， Ph. D.， lecturer. His research interests include complex system， machine learning.
JIANG Jiang， born in 1981， Ph. D.， professor. His research interests include complex system modeling， uncertainty reasoning， decision under risk.
Supported by:
National Key Research and Development Program of China(2022YFE0210700);National Natural Science Foundation of China(72471767);Zhejiang Province Public Welfare Research Program(LTGG23F030003);Fundamental Research Funds for Universities of Zhejiang Province(GK239909299001-010)

摘要/Abstract

摘要：

为了弥补有监督学习对监督信息要求过高的不足，提出一种基于最小先验知识的自监督学习方法。首先，基于数据的先验知识聚类无标签数据，或基于有标签数据的中心距离为无标签数据生成初始标签；其次，随机抽取赋予标签后的数据，并选择机器学习方法建立子模型；再次，计算各个数据抽取的权重和误差，以求得数据平均误差作为各个数据集的数据标签度，并根据初始数据标签度设置迭代阈值；最后，比较迭代过程中数据标签度的大小和阈值决定是否达到终止条件。在10个UCI公开数据集上的实验结果表明，相较于无监督学习K-means等方法、有监督学习支持向量机（SVM）等算法和主流自监督学习TabNet（Tabular Network）等方法，所提方法在不平衡数据集不使用标签，或在平衡数据集上使用有限标签时仍可以取得较高的分类准确度。

关键词: 最小先验知识, 自监督学习, 机器学习, 数据标签度, 迭代阈值

Abstract:

In order to make up for the high demand of supervised information in supervised learning， a self-supervised learning method based on minimal prior knowledge was proposed. Firstly， the unlabeled data were clustered on the basis of the prior knowledge of data， or the initial labels were generated for unlabeled data based on center distances of labeled data. Secondly， the data were selected randomly after labeling， and the machine learning method was selected to build sub-models. Thirdly， the weight and error of each data extraction were calculated to obtain average error of the data as the data label degree for each dataset， and set an iteration threshold based on the initial data label degree. Finally， the termination condition was determined on the basis of comparing the data-label degree and the threshold during the iteration process. Experimental results on 10 UCI public datasets show that compared with unsupervised learning algorithms such as K-means， supervised learning methods such as Support Vector Machine （SVM） and mainstream self-supervised learning methods such as TabNet （Tabular Network）， the proposed method achieves high classification accuracy on unbalanced datasets without using labels or on balanced datasets using limited labels.

Key words: minimal prior knowledge, self-supervised learning, machine learning, data-label degree, iteration threshold

中图分类号:

TP391.4

朱俊屹, 常雷雷, 徐晓滨, 郝智勇, 于海跃, 姜江. 基于最小先验知识的自监督学习方法[J]. 计算机应用, 2025, 45(4): 1035-1041.

Junyi ZHU, Leilei CHANG, Xiaobin XU, Zhiyong HAO, Haiyue YU, Jiang JIANG. Self-supervised learning method using minimal prior knowledge[J]. Journal of Computer Applications, 2025, 45(4): 1035-1041.

图/表 9

参考文献 25

1	甘海涛. 半监督聚类与分类算法研究［D］. 武汉：华中科技大学， 2014： 5-7.
	GAN H T. Research on semi-supervised clustering and classification algorithm ［D］. Wuhan： Huazhong University of Science and Technology， 2014： 5-7.
2	王卫东. 基于自监督学习和深度关系网络的SAR图像变化检测［D］. 西安：西安电子科技大学， 2021：8-10.
	WANG W D. SAR image change detection based on self-supervised learning and deep relation network ［D］. Xi’an： Xidian University， 2021： 8-10.
3	彭超. 基于自监督学习和迁移学习的CT图像肺结节分类研究［D］. 重庆：重庆大学， 2021：14-15.
	PENG C. Research on lung nodule classification in CT image based on self-supervised learning and transfer learning ［D］. Chongqing： Chongqing University， 2021：14-15.
4	周志华. 机器学习［M］. 北京：清华大学出版社， 2016.
	ZHOU Z H. Machine learning ［M］. Beijing： Tsinghua University Press， 2016.
5	LIU X， ZHANG F， HOU Z， et al. Self-supervised learning： generative or contrastive ［J］. IEEE Transactions on Knowledge and Data Engineering， 2023， 35（1）： 857-876.
6	JAISWAL A， BABU A R， ZADEH M Z， et al. A survey on contrastive self-supervised learning ［J］. Technologies， 2020， 9（1）： No.2.
7	张春昊，解滨，张喜梅，等. 一种结合自适应近邻与密度峰值的加权模糊聚类算法［J］. 小型微型计算机系统， 2023， 44（9）： 1974-1982.
	ZHANG C H， XIE B， ZHANG X M， et al. Weighted fuzzy clustering algorithm combining adaptive nearest neighbors and density peaks ［J］. Journal of Chinese Computer Systems， 2023， 44（9）： 1974-1982.
8	NOROOZI M， VINJIMOOR A， FAVARO P， et al. Boosting self-supervised learning via knowledge transfer ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 9359-9367.
9	WU L， LIN H， TAN C， et al. Self-supervised learning on graphs： contrastive， generative， or predictive ［J］. IEEE Transactions on Knowledge and Data Engineering， 2023， 35（4）： 4216-4235.
10	JI J， WANG J， HUANG C， et al. Spatio-temporal self-supervised learning for traffic flow prediction ［C］// Proceedings of the 37th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2023： 4356-4364.
11	RANI V， NABI S T， KUMAR M， et al. Self-supervised learning： a succinct review ［J］. Archives of Computational Methods in Engineering， 2023， 30（4）： 2761-2775.
12	DENIZE J， RABARISOA J， ORCESI A， et al. Similarity contrastive estimation for self-supervised soft contrastive learning［C］// Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2023： 2705-2715.
13	SHWARTZ ZIV R， LeCUN Y. To compress or not to compress self-supervised learning and information theory： a review ［J］. Entropy， 2024， 26（3）： No.252.
14	代雨柔. 基于自监督学习的用户轨迹分析［D］. 成都：电子科技大学， 2022： 20.
	DAI Y R. Human trajectory analysis based on self-supervised learning ［D］. Chengdu： University of Electronic Science and Technology of China， 2022： 20.
15	ARIK S Ö， PFISTER T. TabNet： attentive interpretable tabular learning ［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021： 6679-6687.
16	UÇAR T， HAJIRAMEZANALI E， EDWARDS L. SubTab： subsetting features of tabular data for self-supervised representation learning ［C］// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2021： 18853-18865.
17	DUA D， GRAFF C. The UCI machine learning repository ［DB/OL］. ［2023-08-13］. .
18	HU H， LIU J， ZHANG X， et al. An effective and adaptable K-means algorithm for big data cluster analysis ［J］. Pattern Recognition， 2023， 139： No.109404.
19	DENG D. DBSCAN clustering algorithm based on density ［C］// Proceedings of the 7th International Forum on Electrical Engineering and Automation. Piscataway： IEEE， 2020： 949-953.
20	LI W， WANG Z， SUN W， et al. An ensemble clustering framework based on hierarchical clustering ensemble selection and clusters clustering ［J］. Cybernetics and Systems， 2023， 54（5）： 741-766.
21	ZHAO Z， RUI Z， DUAN X. Feature selection for binary classification based on class labeling， SOM， and hierarchical clustering ［J］. Measurement and Control， 2023， 56（9/10）： 1649-1669.
22	SINAGA T H， WANTO A， GUNAWAN I， et al. Implementation of data mining using C4.5 algorithm on customer satisfaction in Tirta Lihou PDAM ［J］. Journal of Computer Networks， Architecture and High Performance Computing， 2021， 3（1）： 9-20.
23	UDDIN S， HAQUE I， LU H， et al. Comparative performance analysis of K-Nearest Neighbour （KNN） algorithm and its different variants for disease prediction ［J］. Scientific Reports， 2022， 12： No.6256.
24	AL-JARRAH R， AL-OQLA F M. A novel integrated BPNN/SNN artificial neural network for predicting the mechanical performance of green fibers for better composite manufacturing ［J］. Composite Structures， 2022， 289： No.115475.
25	WANG H， LI G， WANG Z. Fast SVM classifier for large-scale classification problems ［J］. Information Sciences， 2023， 642： No.119136.

方法	建模样本种类	样本标签要求	结果
有监督学习	有标签样本	样本标签需要准确反映特征	附带标签
无监督学习	无标签样本	不需要标签	无标签
自监督学习	无标签样本或只有少量有标签样本	样本标签需符合前置任务/锚标签需准确	附带标签
本文方法	无标签样本或只有少量有标签样本	需样本标签分布/锚标签需准确	附带标签

方法	建模样本种类	样本标签要求	结果
有监督学习	有标签样本	样本标签需要准确反映特征	附带标签
无监督学习	无标签样本	不需要标签	无标签
自监督学习	无标签样本或只有少量有标签样本	样本标签需符合前置任务/锚标签需准确	附带标签
本文方法	无标签样本或只有少量有标签样本	需样本标签分布/锚标签需准确	附带标签

类型	编号	数据集	标签0	标签1
标签不平衡数据	1	Appendicitis	21	85
	2	Transfusion	178	570
	3	Immunotherapy	19	71
	4	Record	70	229
	5	Sonar-72	21	51
标签平衡数据	6	heart_stalog	120	150
	7	Somerville Happiness Survey	72	71
	8	credit-a	357	296
	9	waveform_1_2	1 653	1 655
	10	Caesarian	34	46

类型	编号	数据集	标签0	标签1
标签不平衡数据	1	Appendicitis	21	85
	2	Transfusion	178	570
	3	Immunotherapy	19	71
	4	Record	70	229
	5	Sonar-72	21	51
标签平衡数据	6	heart_stalog	120	150
	7	Somerville Happiness Survey	72	71
	8	credit-a	357	296
	9	waveform_1_2	1 653	1 655
	10	Caesarian	34	46

方法		Appendicitis	Transfusion	Immunotherapy	Record	Sonar-72
无监督聚类方法	K-means	79.25	72.33	72.22	59.20	77.78
	DBSCAN	71.69	70.86	68.89	62.54	75.00
	Hierarchical	68.87	75.80	76.67	65.89	76.17
	SOM	73.58	65.11	70.00	51.84	77.78
有监督分类方法	C4.5	84.94	72.71	74.59	74.25	67.97
	KNN	86.02	74.65	74.45	77.19	80.89
	BPNN	88.64	75.87	89.81	62.89	88.89
	SVM	88.18	76.69	75.87	68.82	82.91
本文方法	聚类初始化	83.84	77.78	79.63	68.63	80.46
本文方法	距离初始化	83.58	76.54	76.17	76.80	79.16

基于最小先验知识的自监督学习方法

Self-supervised learning method using minimal prior knowledge

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 25

相关文章 15

编辑推荐

Metrics

方法	heart_stalog	Somerville Happiness Survey	credit-a	waveform_1_2	Caesarian
C4.5	81.11	64.43	83.20	76.50	52.92
KNN	85.60	56.78	69.30	80.90	55.63
BPNN	86.80	55.43	65.06	92.12	41.46
SVM	83.75	59.61	84.90	84.30	59.91
本文方法	78.66	66.46	75.00	89.73	59.71

方法	Appendicitis		Transfusion		Immunotherapy		Record		Sonar-72
方法	最大值	平均值	最大值	平均值	最大值	平均值	最大值	平均值	最大值	平均值
TabNet	86.36	79.70	78.07	74.76	83.33	74.07	78.26	73.00	80.00	68.00
SubTab	86.36	79.39	77.33	73.51	83.33	72.59	81.67	76.00	86.11	75.96
本文方法	87.74	83.58	78.07	76.54	82.22	76.17	78.26	76.80	86.11	79.16
方法	heart_stalog		Somerville Happiness Survey		credit-a		waveform_1_2		Caesarian
方法	最大值	平均值	最大值	平均值	最大值	平均值	最大值	平均值	最大值	平均值
TabNet	66.67	55.14	68.97	55.17	66.41	56.03	94.86	92.72	68.75	55.42
SubTab	81.25	72.50	68.97	59.77	80.92	72.88	93.66	92.63	68.75	60.83
本文方法	85.42	78.66	71.13	66.46	87.44	75.00	94.90	89.73	63.75	59.71

[1]	杨光局, 罗天健, 王开军, 杨思琪. 多分支多视图的时间序列上下文对比表征学习方法[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1042-1052.
[2]	洪梓榕, 包广清. 基于集成学习的雷达自动目标识别综述[J]. 《计算机应用》唯一官方网站, 2025, 45(2): 371-382.
[3]	唐廷杰, 黄佳进, 秦进. 基于图辅助学习的会话推荐[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2711-2718.
[4]	姚梓豪, 栗远明, 马自强, 李扬, 魏良根. 基于机器学习的多目标缓存侧信道攻击检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1862-1871.
[5]	陈学斌, 任志强, 张宏扬. 联邦学习中的安全威胁与防御措施综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1663-1672.
[6]	韩贵金, 张馨渊, 张文涛, 黄娅. 基于多特征融合的自监督图像配准算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1597-1604.
[7]	汪炅, 唐韬韬, 贾彩燕. 无负采样的正样本增强图对比学习推荐方法PAGCL[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1485-1492.
[8]	黄荣, 宋俊杰, 周树波, 刘浩. 基于自监督视觉Transformer的图像美学质量评价方法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1269-1276.
[9]	郑毅, 廖存燚, 张天倩, 王骥, 刘守印. 面向城区的基于图去噪的小区级RSRP估计方法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 855-862.
[10]	佘维, 李阳, 钟李红, 孔德锋, 田钊. 基于改进实数编码遗传算法的神经网络超参数优化[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 671-676.
[11]	李博, 黄建强, 黄东强, 王晓英. 基于异构平台的稀疏矩阵向量乘自适应计算优化[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3867-3875.
[12]	陈学斌, 屈昌盛. 面向联邦学习的后门攻击与防御综述[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3459-3469.
[13]	孙仁科, 皇甫志宇, 陈虎, 李仲年, 许新征. 神经架构搜索综述[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 2983-2994.
[14]	柴汶泽, 范菁, 孙书魁, 梁一鸣, 刘竟锋. 深度度量学习综述[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 2995-3010.
[15]	尹春勇, 周永成. 双端聚类的自动调整聚类联邦学习[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3011-3020.