基于一致性训练的半监督虚假招聘广告检测模型

doi:10.11772/j.issn.1001-9081.2022081163

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (9): 2932-2939.DOI: 10.11772/j.issn.1001-9081.2022081163

• 多媒体计算与计算机仿真 • 上一篇下一篇

基于一致性训练的半监督虚假招聘广告检测模型

王瑞琪, 纪淑娟(), 曹宁, 郭亚杰

山东省智慧矿山信息技术重点实验室（山东科技大学），山东青岛 266590

收稿日期:2022-08-08 修回日期:2023-01-07 接受日期:2023-01-16 发布日期:2023-09-10 出版日期:2023-09-10
通讯作者: 纪淑娟
作者简介:王瑞琪（1997—），女，山东菏泽人，硕士研究生，主要研究方向：人工智能
曹宁（1997—），男，山东菏泽人，博士研究生，主要研究方向：人工智能
郭亚杰（1996—），男，山东东营人，硕士，主要研究方向：人工智能。
基金资助:
国家自然科学基金资助项目(71772107)

Semi-supervised fake job advertisement detection model based on consistency training

Ruiqi WANG, Shujuan JI(), Ning CAO, Yajie GUO

Shandong Provincial Key Laboratory of Wisdom Mine Information Technology （Shandong University of Science and Technology），Qingdao Shandong 266590，China

Received:2022-08-08 Revised:2023-01-07 Accepted:2023-01-16 Online:2023-09-10 Published:2023-09-10
Contact: Shujuan JI
About author:WANG Ruiqi， born in 1997， M. S. candidate. Her research interests include artificial intelligence.
CAO Ning， born in 1997， Ph. D. candidate. His research interests include artificial intelligence.
GUO Yajie， born in 1996， M. S. His research interests include artificial intelligence.
Supported by:
National Natural Science Foundation of China(71772107)

摘要/Abstract

摘要：

虚假招聘广告的泛滥不仅会损害求职者的合法权益，还会扰乱正常的就业秩序，造成求职者极差的用户体验。为了有效检测出虚假招聘广告，提出一种基于一致性训练的半监督虚假招聘广告检测模型（SSC）。首先，对所有数据应用一致性正则项提升模型的性能；然后，通过联合训练的方式整合有监督损失和无监督损失得到半监督损失；最后，使用半监督损失对模型进行优化。在两个真实数据集EMSCAD （EMployment SCam Aegean Dataset）和IMDB （Internet Movie DataBase）上的实验结果表明，SSC在标签数据仅为20时取得了最好的检测效果，准确率与现有先进的半监督学习模型UDA （Unsupervised Data Augmentation）相比提升了2.2和2.8个百分点，与深度学习模型BERT （Bidirectional Encoder Representations from Transformers）相比提升了3.4和11.7个百分点，同时还具有较好的可拓展性。

关键词: 虚假信息检测, 半监督学习, 网络招聘, 虚假招聘广告, 一致性训练

Abstract:

The flood of fake job advertisements will not only damage the legitimate rights and interests of job seekers but also disrupt the normal employment order， which results in a poor user experience for job seekers. To effectively detect fake job advertisements， an SSC （Semi-Supervised fake job advertisements detection model based on Consistency training） was proposed. Firstly， the consistency regularization term was applied on all the data to improve the performance of the model. Then， supervised loss and unsupervised loss were integrated through joint training to obtain the semi-supervised loss. Finally， the semi-supervised loss was used to optimize the model. Experimental results on two real datasets EMSCAD （EMployment SCam Aegean Dataset） and IMDB （Internet Movie DataBase） show that SSC achieves the best detection performance when the labeled data are only 20， and the accuracy is increased by 2.2 and 2.8 percentage points compared with the existing advanced semi-supervised learning model UDA （Unsupervised Data Augmentation）， and is increased by 3.4 and 11.7 percentage points compared with the deep learning model BERT （Bidirectional Encoder Representations from Transformers）. At the same time， SSC has good scalability.

Key words: false information detection, semi-supervised learning, online recruitment, fake job advertisement, consistency training

中图分类号:

TP391.1

王瑞琪, 纪淑娟, 曹宁, 郭亚杰. 基于一致性训练的半监督虚假招聘广告检测模型[J]. 计算机应用, 2023, 43(9): 2932-2939.

Ruiqi WANG, Shujuan JI, Ning CAO, Yajie GUO. Semi-supervised fake job advertisement detection model based on consistency training[J]. Journal of Computer Applications, 2023, 43(9): 2932-2939.

图/表 13

参考文献 36

1	新浪财经. 赶集网直招被指陷阱重重，虚假招聘引发用户信任危机［EB/OL］. （2022-02-23）［2022-11-15］..
	SINA Finance. The direct recruitment of Ganji.com has been accused of many pitfalls， and the false recruitment has caused a crisis of trust among users［EB/OL］. （2022-02-23）［2022-11-15］..
2	艾瑞咨询. 2021年中国网络招聘行业市场发展研究报告［EB/OL］. （2021-03-31）［2022-11-15］..
	iResearch. Research report on the development of China’s online recruitment industry market in 2021［EB/OL］. （2021-03-31）［2022-11-15］..
3	视点陕西. Z世代移动求职成主流，新兴行业备受Z世代求职者的青睐［EB/OL］. （2021-05-21）［2022-11-18］..
	Shaanxi CNTV. Generation Z mobile job search into the mainstream， emerging industries are favored by Generation Z job seekers［EB/OL］. （2021-05-21）［2022-11-18］..
4	艾美网. 艾媒报告 \| 2019中国互联网招聘行业市场研究报告［EB/OL］. （2019-03-21）［2022-09-08］..
	iiMedia.cn. Market research report on China’s Internet recruitment industry in 2019［EB/OL］. （2019-03-21）［2022-09-08］..
5	VIDROS S， KOLIAS C， KAMBOURAKIS G. Online recruitment services： another playground for fraudsters［J］. Computer Fraud and Security， 2016， 2016（3）：8-13. 10.1016/s1361-3723(16)30025-2
6	HABIBA S U， ISLAM M K， TASNIM F. A comparative study on fake job post prediction using different data mining techniques［C］// Proceedings of the 2nd International Conference on Robotics， Electrical and Signal Processing Techniques. Piscataway： IEEE， 2021：543-546. 10.1109/icrest51555.2021.9331230
7	VIDROS S， KOLIAS C， KAMBOURAKIS G， et al. Automatic detection of online recruitment frauds： characteristics， methods， and a public dataset［J］. Future Internet， 2017， 9（1）： No.6. 10.3390/fi9010006
8	MAHBUB S， PARDEDE E. Using contextual features for online recruitment fraud detection［C/OL］// Proceedings of the 27th International Conference on Information Systems Development （2018）［2022-11-12］.. 10.1109/access.2022.3197225
9	NINDYATI O， NUGRAHA I G B B. Detecting scam in online job vacancy using behavioral features extraction［C］// Proceedings of the 2019 International Conference on ICT for Smart Society. Piscataway： IEEE， 2019：1-4. 10.1109/iciss48059.2019.8969842
10	LAL S， JIASWAL R， SARDANA N， et al. ORFDetector： ensemble learning based online recruitment fraud detection［C］// Proceedings of the 12th International Conference on Contemporary Computing. Piscataway： IEEE， 2019：1-5. 10.1109/ic3.2019.8844879
11	ALGHAMDI B， ALHARBY F. An intelligent model for online recruitment fraud detection［J］. Journal of Information Security， 2019， 10（3）：155-176. 10.4236/jis.2019.103009
12	DUTTA S， BANDYOPADHYAY S K. Fake job recruitment detection using machine learning approach［J］. International Journal of Engineering Trends and Technology， 2020， 68（4）：48-53. 10.14445/22315381/ijett-v68i4p209s
13	MEHBOOB A， MALIK M S I. Smart fraud detection framework for job recruitments［J］. Arabian Journal for Science and Engineering， 2021， 46（4）：3067-3078. 10.1007/s13369-020-04998-2
14	SHREE R A， NIRMALA D， SWEATHA S， et al. Ensemble modeling on job scam detection［J］. Journal of Physics： Conference Series， 2021， 1916： No.012167. 10.1088/1742-6596/1916/1/012167
15	TABASSUM H， GHOSH G， ATIKA A， et al. Detecting online recruitment fraud using machine learning［C］// Proceedings of the 9th International Conference on Information and Communication Technology. Piscataway： IEEE， 2021：472-477. 10.1109/icoict52021.2021.9527477
16	KIM J， KIM H J， KIM H. Fraud detection for job placement using hierarchical clusters-based deep neural networks［J］. Applied Intelligence， 2019， 49（8）：2842-2861. 10.1007/s10489-019-01419-2
17	GOYAL N， SACHDEVA N， KUMARAGURU P. Spy the lie： fraudulent jobs detection in recruitment domain using knowledge graphs［C］// Proceedings of the 2021 International Conference on Knowledge Science， Engineering and Management， LNCS 12816. Cham： Springer， 2021：612-623.
18	吴洁，张师天，谢海滨，等. 基于多影像中心磁共振成像数据的半监督膝盖异常分类［J］. 计算机应用， 2022， 42（1）： 316-324.
	WU J， ZHANG S T， XIE H B， et al. Semi-supervised knee abnormality classification based on multi-imaging center MRI data［J］. Journal of Computer Applications， 2022， 42（1）： 316-324.
19	胡明玉，夏雪，杨晨雪，等. 基于深度学习的半监督图像标注系统设计与实现［J］. 中国农业大学学报， 2021， 26（5）：153-162. 10.11841/j.issn.1007-4333.2021.05.15
	HU M Y， XIA X， YANG C X， et al. Design and implementation of semi-supervised image labeling system based on deep learning［J］. Journal of China Agricultural University， 2021， 26（5）：153-162. 10.11841/j.issn.1007-4333.2021.05.15
20	BERTHELOT D， CARLINI N， GOODFELLOW I， et al. MixMatch： a holistic approach to semi-supervised learning［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2019：5049-5059.
21	SOHN K， BERTHELOT D， LI C L， et al. FixMatch： simplifying semi-supervised learning with consistency and confidence［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2020：596-608.
22	韩萍，刘爽，贾云飞，等. 基于变分自编码的半监督微博文本情感分析［J］. 计算机应用与软件， 2021， 38（12）：280-285. 10.3969/j.issn.1000-386x.2021.12.045
	HAN P， LIU S， JIA Y F， et al. Sentiment analysis of semi-supervised Weibo text based on variational self-encoding［J］. Computer Applications and Software， 2021， 38 （12）：280-285. 10.3969/j.issn.1000-386x.2021.12.045
23	XIE Q Z， DAI Z H， HOVY E， et al. Unsupervised data augmentation for consistency training［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2020：6256-6268.
24	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： ACL， 2019：4171-4186. 10.18653/v1/n18-2
25	欧莉莉，邵峰晶，孙仁诚，等. 基于半监督方法的脑梗死图像识别［J］. 计算机应用， 2021， 41（4）：1221-1226.
	OU L L， SHAO F J， SUN R C， et al. Cerebral infarction image recognition based on semi-supervised method［J］. Journal of Computer Applications， 2021， 41（4）：1221-1226.
26	CHAPELLE O， SCHOLKOPF B， ZIEN A. Semi-supervised learning［J］. IEEE Transactions on Neural Networks， 2009， 20（3）： 542-542.
27	MIYATO T， DAI A M， GOODFELLOW I. Adversarial training methods for semi-supervised text classification［EB/OL］. （2021-11-16）［2022-02-25］..
28	RASMUS A， VALPOLA H， HONKALA M， et al. Semi-supervised learning with ladder networks［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. Cambridge： MIT Press， 2015：3546-3554.
29	EDUNOV S， OTT M， AULI M， et al. Understanding back-translation at scale［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2018：489-500. 10.18653/v1/d18-1045
30	META AI. mBART-large-50-many-to-many-MMT［EB/OL］. ［2022-03-08］..
31	HINTON G E， SRIVASTAVA N， KRIZHEVSKY A， et al. Improving neural networks by preventing co-adaptation of feature detectors［EB/OL］. （2012-07-03）［2022-04-17］..
32	MA X Z， GAO Y K， HU Z T， et al. Dropout with expectation-linear regularization［EB/OL］. （2017-02-15）［2022-04-05］..
33	WU L， LI J， WANG Y， et al. R-Drop： regularized dropout for neural networks［C/OL］// Proceedings of the 35th Conference on Neural Information Processing Systems （2021）［2022-11-16］..
34	MAAS A， DALY R E， PHAM P T， et al. Learning word vectors for sentiment analysis［C］// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： ACL， 2011：142-150.
35	GUO L Z， ZHANG Z Y， JIANG Y， et al. Safe deep semi-supervised learning for unseen-class unlabeled data［C］// Proceedings of the 37th International Conference on Machine Learning. New York： JMLR.org， 2020： 3897-3906.
36	VAN DER MAATEN L， HINTON G E. Visualizing data using t-SNE［J］. Journal of Machine Learning Research， 2008， 9：2579-2605.

数据集	所属领域	真实/积极样本数	欺诈/消极样本数
EMSCAD	招聘广告文本	17 014	866
IMDB	电影评论文本	25 000	25 000

数据集	所属领域	真实/积极样本数	欺诈/消极样本数
EMSCAD	招聘广告文本	17 014	866
IMDB	电影评论文本	25 000	25 000

名称	属性描述	属性类型
title	广告标题	文本特征
location	广告地理位置	文本特征
department	公司内部部门	文本特征
salary_range	薪资范围	数字特征
company_profile	公司简介	文本特征
description	招聘广告描述	文本特征
requirements	工作要求	文本特征
benefits	工作福利	文本特征
telecommuting	是否远程办公	数字特征
has_company_logo	是否有公司logo	数字特征
employment_type	就业类型	文本特征
required_experience	所需的工作经验	文本特征
required_education	所需的教育水平	文本特征
industry	公司所属领域	文本特征
function	岗位的作用	文本特征
fraudulent	欺诈与否	数字特征

名称	属性描述	属性类型
title	广告标题	文本特征
location	广告地理位置	文本特征
department	公司内部部门	文本特征
salary_range	薪资范围	数字特征
company_profile	公司简介	文本特征
description	招聘广告描述	文本特征
requirements	工作要求	文本特征
benefits	工作福利	文本特征
telecommuting	是否远程办公	数字特征
has_company_logo	是否有公司logo	数字特征
employment_type	就业类型	文本特征
required_experience	所需的工作经验	文本特征
required_education	所需的教育水平	文本特征
industry	公司所属领域	文本特征
function	岗位的作用	文本特征
fraudulent	欺诈与否	数字特征

名称	属性描述	属性类型
review	电影评论	文本特征
sentiment	情感极性	数字特征

基于一致性训练的半监督虚假招聘广告检测模型

Semi-supervised fake job advertisement detection model based on consistency training

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 36

相关文章 15

编辑推荐

Metrics

数据集	有标签样本数	无标签样本数	监督样本数占比/%
EMSCAD	20	1 732	1.10
	100	1 732	5.40
	200	1 732	10.30
	300	1 732	14.70
	400	1 732	18.70
	20	17 880	0.10
IMDB	20	20 000	0.10
	100	20 000	0.50
	200	20 000	1.00
	300	20 000	1.50
	400	20 000	2.00
	20	50 000	0.03

参数	取值	物理意义
学习率	1E-5（EMSCAD）， 2E-5（IMDB）	学习率
序列最大长度	128	序列最大长度
Dropout	0.3	神经单元失效的概率
TSA	exp_schedule	减轻标签数据较少过拟合造成的影响
λ	4	控制双向KL散度所占比重

标签数	类型	模型	EMSCAD				IMDB
标签数	类型	模型	Acc	P	R	F1	Acc	P	R	F1
20	监督学习	随机森林	0.636	0.663	0.636	0.621	0.557	0.573	0.557	0.550
		SVM	0.646	0.647	0.646	0.645	0.524	0.524	0.524	0.523
		BERT	0.666	0.677	0.666	0.661	0.600	0.605	0.600	0.594
	半监督学习	UDA	0.678	0.699	0.678	0.669	0.689	0.698	0.689	0.685
	半监督学习	SSC	0.700	0.717	0.700	0.695	0.717	0.719	0.717	0.716
100	监督学习	随机森林	0.771	0.772	0.771	0.771	0.644	0.645	0.644	0.644
		SVM	0.719	0.722	0.719	0.719	0.602	0.607	0.602	0.600
		BERT	0.800	0.801	0.800	0.800	0.741	0.751	0.741	0.738
	半监督学习	UDA	0.825	0.838	0.825	0.823	0.779	0.789	0.779	0.777
	半监督学习	SSC	0.839	0.848	0.839	0.838	0.792	0.792	0.792	0.791
200	监督学习	随机森林	0.829	0.829	0.829	0.829	0.722	0.724	0.722	0.721
		SVM	0.781	0.781	0.781	0.781	0.675	0.677	0.676	0.675
		BERT	0.865	0.864	0.865	0.865	0.796	0.800	0.796	0.796
	半监督学习	UDA	0.871	0.873	0.871	0.871	0.817	0.817	0.817	0.817
	半监督学习	SSC	0.888	0.888	0.888	0.888	0.824	0.826	0.824	0.823
300	监督学习	随机森林	0.879	0.879	0.879	0.879	0.743	0.745	0.743	0.742
		SVM	0.847	0.847	0.847	0.847	0.713	0.716	0.713	0.712
		BERT	0.895	0.896	0.895	0.895	0.822	0.823	0.822	0.822
	半监督学习	UDA	0.908	0.909	0.908	0.908	0.832	0.833	0.832	0.832
	半监督学习	SSC	0.915	0.915	0.915	0.915	0.840	0.841	0.840	0.840
400	监督学习	随机森林	0.903	0.903	0.903	0.903	0.760	0.761	0.760	0.759
		SVM	0.885	0.887	0.885	0.884	0.734	0.737	0.734	0.734
		BERT	0.918	0.918	0.908	0.908	0.843	0.845	0.844	0.844
	半监督学习	UDA	0.923	0.923	0.923	0.923	0.848	0.849	0.848	0.848
	半监督学习	SSC	0.926	0.926	0.926	0.926	0.851	0.852	0.851	0.851

[1]	张英俊, 李牛牛, 谢斌红, 张睿, 陆望东. 课程学习指导下的半监督目标检测框架[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2326-2333.
[2]	周妍, 李阳. 用于脑卒中病灶分割的具有注意力机制的校正交叉伪监督方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1942-1948.
[3]	张帅华, 张淑芬, 周明川, 徐超, 陈学斌. 基于半监督联邦学习的恶意流量检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3487-3494.
[4]	伏博毅, 彭云聪, 蓝鑫, 秦小林. 基于深度学习的标签噪声学习算法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 674-684.
[5]	方昕, 黄泽鑫, 张聿晗, 高天, 潘嘉, 付中华, 高建清, 刘俊华, 邹亮. 基于时域波形的半监督端到端虚假语音检测方法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 227-231.
[6]	李锦烨, 黄瑞章, 秦永彬, 陈艳平, 田小瑜. 基于反绎学习的裁判文书量刑情节识别[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1802-1807.
[7]	邱永茹, 姚光乐, 冯杰, 崔昊宇. 基于半监督学习的单幅图像去雨算法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1577-1582.
[8]	殷雨昌, 王洪元, 陈莉, 冯尊登, 肖宇. 基于单标注样本的多损失学习与联合度量视频行人重识别[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 764-769.
[9]	孟杰, 王莉, 杨延杰, 廉飚. 基于多模态深度融合的虚假信息检测[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 419-425.
[10]	吴洁, 张师天, 谢海滨, 杨光. 基于多影像中心磁共振成像数据的半监督膝盖异常分类[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 316-324.
[11]	张师鹏, 李永忠, 杜祥通. 基于半监督学习和三支决策的入侵检测模型[J]. 计算机应用, 2021, 41(9): 2602-2608.
[12]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[13]	曹玉红, 徐海, 刘荪傲, 王紫霄, 李宏亮. 基于深度学习的医学影像分割研究综述[J]. 《计算机应用》唯一官方网站, 2021, 41(8): 2273-2287.
[14]	朱玉娜, 张玉涛, 闫少阁, 范钰丹, 陈韩托. 基于半监督子空间聚类的协议识别方法[J]. 计算机应用, 2021, 41(10): 2900-2904.
[15]	吕亚丽, 苗钧重, 胡玮昕. 基于标签进行度量学习的图半监督学习算法[J]. 计算机应用, 2020, 40(12): 3430-3436.

模型	EMSCAD				IMDB
模型	Acc	P	R	F1	Acc	P	R	F1
UDA	0.715	0.716	0.715	0.714	0.842	0.844	0.842	0.842
SSC	0.735	0.738	0.735	0.734	0.869	0.870	0.869	0.869

模型	EMSCAD				IMDB
模型	Acc	P	R	F1	Acc	P	R	F1
UDA	0.715	0.716	0.715	0.714	0.842	0.844	0.842	0.842
SSC	0.735	0.738	0.735	0.734	0.869	0.870	0.869	0.869

类型	模型	不同数据集上的运行时间/s
类型	模型	EMSCAD	IMDB
监督学习	随机森林	1.2×10^-1	1.9×10^-1
	SVM	3.7×10^-2	1.0×10^-1
	BERT	1.4×10³	2.5×10³
半监督学习	UDA	3.5×10³	4.6×10³
半监督学习	SSC	7.7×10³	8.9×10³

类型	模型	不同数据集上的运行时间/s
类型	模型	EMSCAD	IMDB
监督学习	随机森林	1.2×10^-1	1.9×10^-1
	SVM	3.7×10^-2	1.0×10^-1
	BERT	1.4×10³	2.5×10³
半监督学习	UDA	3.5×10³	4.6×10³
半监督学习	SSC	7.7×10³	8.9×10³