Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (6): 1785-1792.DOI: 10.11772/j.issn.1001-9081.2025050662

• Artificial intelligence • Previous Articles    

Legal case retrieval method via case information reformulation using large language model

Jintao WANG, Zhilin GAO, Qixiang MENG, Fanliang BU()   

  1. School of Information and Cyber Security,People’s Public Security University of China,Beijing 100038,China
  • Received:2025-06-23 Revised:2025-09-07 Accepted:2025-09-09 Online:2025-09-15 Published:2026-06-10
  • Contact: Fanliang BU
  • About author:WANG Jintao, born in 2000, M. S. candidate. His research interests include natural language processing, information retrieval.
    GAO Zhilin, born in 2000, M. S. candidate. His research interests include information extraction, deep learning.
    MENG Qixiang, born in 1999, M. S. candidate. His research interests include information extraction, deep learning.
    First author contact:BU Fanliang, born in 1965, Ph. D., professor. His research interests include security and prevention engineering, deep learning.
  • Supported by:
    Double First-Class Innovation Research Project for Security and Protection Engineering of People’s Public Security University of China(2023SYL08)

基于大语言模型重构案件信息的类案检索方法

王劲滔, 高志霖, 孟琪翔, 卜凡亮()   

  1. 中国人民公安大学 信息网络安全学院,北京 100038
  • 通讯作者: 卜凡亮
  • 作者简介:王劲滔(2000—),男,江苏扬州人,硕士研究生,CCF会员,主要研究方向:自然语言处理、信息检索
    高志霖(2000—),男,江苏徐州人,硕士研究生,主要研究方向:信息提取、深度学习
    孟琪翔(1999—),男,江苏徐州人,硕士研究生,主要研究方向:信息提取、深度学习
    第一联系人:卜凡亮(1965—),男,江苏徐州人,教授,博士,主要研究方向:安全防范工程、深度学习。
  • 基金资助:
    中国人民公安大学安全防范工程双一流专项(2023SYL08)

Abstract:

With the advancement of intelligent judiciary construction, legal case retrieval technology has garnered significant attention due to its crucial role in ensuring judicial fairness and efficiency. However, the existing text retrieval methods still face the following challenges: the traditional models are susceptible to interference from semantic structural similarities, making it difficult to capture elements that influence judgments accurately; the pre-trained language models are constrained by input length, leading to insufficient global semantic modeling of lengthy legal texts; and the existing aggregated similarity scoring mechanisms are prone to noise interference and lack strong interpretability. To address these challenges, a legal case retrieval method via case information reformulation using large language model (LLM) was proposed. Firstly, LLM was employed to extract information from case texts, so as to combine case elements, descriptions of applicable legal provisions for crimes, and case behavior chains into sub-facts of cases, thereby reducing information redundancy. Secondly, in the encoding part, an SFA-SAILER (Selective Feature Attention & Structure-Aware pre-traIned language model for LEgal case Retrieval) encoding architecture was designed. Thirdly, by encoding case information at two different dimensions deeply — word and feature, the dependency between case information and encoding dimensions was enhanced. Finally, the MaxSim operator was used to aggregate similarity scores. Experimental results show that on the LeCaRD (Legal Case Retrieval Dataset), the proposed model achieves the mean Average Precision (mAP) and Top-3 Precision (P@3) of 67.45% and 60.95%, respectively, and has the Top-K Normalized Discounted Cumulative Gain (NDCG@K) higher than those of comparison models. It can be seen that the proposed model offers a new idea that integrates legal logic with deep semantic understanding for legal case retrieval, and has practical value for intelligent judiciary applications.

Key words: Large Language Model (LLM), legal case retrieval, intelligent judiciary, text matching

摘要:

随着智慧司法建设的推进,类案检索技术因为在保障司法公正性与效率性中的关键作用备受关注。然而,现有文本检索方法仍面临以下挑战:传统模型易受语义结构相似性干扰,难以精准捕捉影响判决的要素;预训练语言模型受限于输入长度,对冗长法律文本的全局语义建模不足;现有的聚合相似度评分机制易受噪声干扰,可解释性不强。针对上述问题,提出一种基于大语言模型(LLM)重构案件信息的类案检索方法。首先,利用LLM对案件文本进行信息抽取,以将案件要素、罪行适用法条描述与案件行为链组合成案件子事实,从而减少信息冗余;其次,在编码部分,设计SFA-SAILER (Selective Feature Attention & Structure-Aware pre-traIned language model for LEgal case Retrieval)编码架构;再次,通过在词与特征两个不同维度对案件信息进行深度编码,增强案件信息与编码维度间的依赖关系;最后,使用MaxSim操作符聚合相似度分数。实验结果表明,所提模型在LeCaRD (Legal Case Retrieval Dataset)上的平均精确率均值(mAP)与前3个结果的精确率(P@3)指标分别达到了67.45%和60.95%,而前K个结果的归一化折损累计增益(NDCG@K)指标也均高于对比模型。可见,所提模型可为类案检索提供兼顾法律逻辑与深度语义理解的新思路,在司法智能化应用中具有实践价值。

关键词: 大语言模型, 类案检索, 智慧司法, 文本匹配

CLC Number: