《计算机应用》唯一官方网站

• •    下一篇

面向虚假新闻检测的大语言模型检测器偏好与鲁棒性探索

何灿源1,李大洋1,刘仁阳2,曾雅馨1,宋冰冰1,姚绍文3,梁宇3,周维4   

  1. 1. 云南大学
    2. 新加坡国立大学
    3. 云南大学软件学院
    4. 云南大学 软件学院,昆明 650091;
  • 收稿日期:2026-02-09 修回日期:2026-04-20 发布日期:2026-05-13 出版日期:2026-05-13
  • 通讯作者: 周维

Exploring large language model detector preference and robustness for fake news detection

  • Received:2026-02-09 Revised:2026-04-20 Online:2026-05-13 Published:2026-05-13

摘要: 具备强大参数化知识与推理能力的大语言模型(LLM),在虚假新闻检测领域展现出巨大潜力,但其内部决策过程的鲁棒性仍未被充分探索。观察发现,LLM在检测时会呈现出与人类事实核查员相似的稳定推理轨迹,这一现象被定义为认知路径偏好。首先提出了基于“LLM as a Judge”范式的路径偏好评估框架(Path Preference Evaluation Framework,PPEF),对该现象进行形式化定义与量化分析。PPEF首先提取检测器的解释性推理依据,将其映射到偏好特征本体中,进而识别出多个LLM共有的路径偏好。为探究此类偏好是否会被恶意传播者利用并转化为检测漏洞,设计了多视角重写方法(Multi-View Rewrite,MVR),在保留原始新闻语义的前提下,选择性地弱化与LLM偏好匹配的线索。基于多个公开数据集与不同检测模型的实验结果表明:LLM在虚假新闻检测过程中,会显著且持续地依赖少量核心线索;相较于风格重写(Style-Based Rewriting,SBG)、开放生成(Open-ended Generation,OEG)等5种非偏好导向的重写,针对偏好特征定向弱化的重写方法MVR会导致检测器性能出现更严重的下降,其中在Gossipcop数据集上,MVR生成的对抗样本使基础CoT驱动的LLM(Vanilla LLM)的识别率下降了1.8至46.7个百分点,干扰效果显著优于SBG、OEG等基线方法生成的对抗样本。上述结论揭示了一种可被实际利用的LLM漏洞,并为构建偏好感知型、路径多样化的检测框架提供了新的思路。

Abstract: Large Language Model (LLM) equipped with strong parametric knowledge and reasoning capabilities has shown great potential in fake news detection, yet its internal decision-making robustness remains largely unexplored. Observations show that LLM displays stable reasoning trajectories similar to those of human fact-checkers, which is termed cognitive path preference. Path Preference Evaluation Framework (PPEF) based on "LLM as a Judge" paradigm is first proposed to formalize and quantify this phenomenon. PPEF first extracts explanatory reasoning rationales of detectors, maps them to preference feature ontology, and further identifies path preferences shared by multiple LLMs. To further examine whether these preferences can be exploited as vulnerabilities by malicious propagators, Multi-View Rewrite (MVR) strategy is designed that selectively weakens cues aligned with LLM preferences while keeping original news semantics intact. Experiments across multiple public datasets and different detection models show that LLM exhibits clear and consistent dependence on a small set of cues during fake news detection. Compared to five non-preference-oriented rewriting methods including Style-Based Rewriting (SBG) and Open-ended Generation (OEG), MVR targeting preference features leads to more severe performance degradation of detectors. Specifically, on Gossipcop dataset, adversarial samples generated by MVR decrease detection rates of basic CoT-driven LLM (Vanilla LLM) by 1.8 to 46.7 percentage points, demonstrating significantly superior interference effectiveness over baseline methods like SBG and OEG. These results reveal a practically exploitable LLM vulnerability and offer insights for developing preference-aware and path-diversified detection frameworks.

Key words: Multi-View Rewrite (, MVR)

中图分类号: