Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (10): 3281-3287.DOI: 10.11772/j.issn.1001-9081.2023101558

• The 40th CCF National Database Conference (NDBC 2023) • Previous Articles     Next Articles

Semi-supervised stance detection based on category-aware curriculum learning

Zhaoze GAO, Xiaofei ZHU(), Nengqiang XIANG   

  1. College of Computer Science and Engineering,Chongqing University of Technology,Chongqing 400054,China
  • Received:2023-11-13 Revised:2023-12-28 Accepted:2024-01-02 Online:2024-10-15 Published:2024-10-10
  • Contact: Xiaofei ZHU
  • About author:GAO Zhaoze, born in 1996, M. S. candidate. His research interests include natural language processing, stance detection.
    XIANG Nengqiang, born in 1998, M. S. candidate. His research interests include natural language processing, social network.
  • Supported by:
    Chongqing Natural Science Foundation(CSTB2022NSCQ-MSX1672);Science and Technology Research Plan Major Project of Chongqing Municipal Education Commission(KJZD-M202201102);Chongqing University of Technology School-Level Joint Funding Project(gzlcx20233248)

基于类别感知课程学习的半监督立场检测

高肇泽, 朱小飞(), 项能强   

  1. 重庆理工大学 计算机科学与工程学院,重庆 400054
  • 通讯作者: 朱小飞
  • 作者简介:高肇泽(1996—),男,山东枣庄人,硕士研究生,CCF会员,主要研究方向:自然语言处理、立场检测
    朱小飞(1979—),男,重庆人,教授,博士,CCF会员,主要研究方向:自然语言处理、信息检索 zxf@cqut.edu.cn
    项能强(1998—),男,四川达州人,硕士研究生,CCF会员,主要研究方向:自然语言处理、社交网络。
  • 基金资助:
    重庆市自然科学基金资助项目(CSTB2022NSCQ?MSX1672);重庆市教育委员会科学技术研究计划重大项目(KJZD?M202201102);重庆理工大学校级联合资助项目(gzlcx20233248)

Abstract:

Pseudo-label generation emerges as an effective strategy in semi-supervised stance detection. In practical applications, variations are observed in the quality of generated pseudo-labels. However, in the existing working, the quality of these labels is regarded as equivalent. Furthermore, the influence of category imbalance on the quality of pseudo-label generation is not fully considered. To address these issues, a Semi-supervised stance Detection model based on Category-aware curriculum Learning (SDCL) was proposed. Firstly, a pre-trained classification model was employed to generate pseudo-labels for unlabeled tweets. Then, tweets were sorted by category based on the quality of pseudo-labels, and the top k high-quality tweets for each category were selected. Finally, the selected tweets from each category were merged, re-sorted, and input into the classification model with pseudo-labels, thereby further optimizing the model parameters. Experimental results indicate that compared to the best-performing baseline model, SANDS (Stance Analysis via Network Distant Supervision), the proposed model demonstrates improvements in Mac-F1 (Macro-averaged F1) scores on StanceUS dataset by 2, 1, and 3 percentage points respectively under three different splits (with 500, 1 000, and 1 500 labeled tweets). Similarly, on StanceIN dataset, the proposed model exhibits enhancements in Mac-F1 scores by 1 percentage point under the three splits, thereby validating the effectiveness of the proposed model.

Key words: semi-supervised, stance detection, category imbalance, curriculum learning, pseudo-label generation

摘要:

生成伪标签是半监督立场检测的一种有效策略。在现实应用中,生成的伪标签质量存在差异,然而现有的工作将生成伪标签的质量视为是同等的,且没有充分考虑类别不平衡对伪标签生成质量的影响。为了解决上述2个问题,提出基于类别感知课程学习的半监督立场检测模型(SDCL)。首先,使用预训练分类模型对无标签推文生成伪标签;其次,根据伪标签质量的高低对推文按类别排序,并选取每个类别前k个高质量推文;最后,将各个类别选出的推文合并后重新排序,并把排序后带有伪标签的推文再输入分类模型,从而进一步优化模型参数。实验结果表明,与基线模型中表现最好的SANDS (Stance Analysis via Network Distant Supervision)相比,所提模型在3种不同划分(有标签推文总数为500、1 000和1 500)情况下,在StanceUS数据集上的宏平均(Mac-F1)分数分别提高了2、1和3个百分点,在StanceIN数据集上的Mac-F1分数均提高了1个百分点,验证了所提模型的有效性。

关键词: 半监督, 立场检测, 类别不平衡, 课程学习, 伪标签生成

CLC Number: