Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (5): 1293-1298.DOI: 10.11772/j.issn.1001-9081.2018102085

• Artificial intelligence • Previous Articles     Next Articles

Efficient judicial document classification based on knowledge block summarization and word mover’s distance

MA Jiangang1,2,3, ZHANG Peng4, MA Yinglong4   

  1. 1. School of Law, Renmin University of China, Beijing 100872, China;
    2. National Prosecutor College, Beijing 102206, China;
    3. People's Procuratorate of Henan Province, Zhengzhou Henan 450004, China;
    4. School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China
  • Received:2018-10-15 Revised:2018-11-16 Online:2019-05-14 Published:2019-05-10
  • Supported by:
    This work is partially supported by the National Key R&D Program of China (2018YFC0831404), the China Postdoctoral Science Foundation (2016M591317).

基于知识块摘要和词转移距离的高效司法文档分类

马建刚1,2,3, 张鹏4, 马应龙4   

  1. 1. 中国人民大学 法学院, 北京 100872;
    2. 国家检察官学院, 北京 102206;
    3. 河南省人民检察院, 河南 郑州 450004;
    4. 华北电力大学 控制与计算机工程学院, 北京 102206
  • 通讯作者: 马应龙
  • 作者简介:马建刚(1977-),男,河南郑州人,高级工程师,博士,CCF高级会员,主要研究方向:大数据、智慧检务、智慧司法;张鹏(1995-),男,河南周口人,硕士研究生,CCF会员,主要研究方向:自然语言处理、知识图谱;马应龙(1976-),男,陕西咸阳人,教授,博士生导师,博士,CCF高级会员,主要研究方向:人工智能、知识工程、大数据分析、服务计算、服务协同。
  • 基金资助:
    国家重点研发计划项目(2018YFC0830605,2018YFC0831404);中国博士后科学基金资助项目(2016M591317)。

Abstract: With the deepening of intelligence construction of the national judicial organization, massive judicial documents accumulated through years of information technology application provide data analysis basis for developing judicial intelligent service. The quality and efficiency of case handling can be greatly improved through the analysis of the similarity of judicial documents, which realizes the push of similar cases to provide the judicial officials with intelligent assistant case handling decision support. Aiming at the low efficiency of most document classification approach for common domains in judicial document classification due to the lack of consideration of complex structure and knowledge semantics of specific judicial documents, an efficient judicial document classification approach based on knowledge block summarization and Word Mover's Distance (WMD) was proposed. Firstly, a domain ontology knowledge model was built for judicial documents. Secondly, based on domain ontology, the core knowledge block summarization of judicial documents was obtained by information extraction technology. Thirdly, WMD algorithm was used to calculate judicial document similarity based on knowledge block summary of judicial text. Finally, K-Nearest Neighbors (KNN) algorithm was used to realize judicial document classification. With the documents of two typical crimes used as experimental data, the experimental results show that the proposed approach greatly improves the accuracy of judicial document classification by 5.5 and 9.9 percentage points respectively with the speed of 52.4 and 89.1 times respectively compared to traditional WMD similarity computation algorithm.

Key words: smart procuratorate, domain ontology model, document classification, similarity computation, knowledge block summarization, Word Mover's Distance (WMD)

摘要: 随着全国司法机关智能化建设的深入推进,通过信息化建设应用所积累的海量司法文书为开展司法智能服务提供了司法数据分析基础。通过司法文书的相似性分析实现类案推送,可以为司法人员提供智能辅助办案决策支持,从而提高办案的质量和效率。针对面向通用领域的文本分类方法因没有考虑特定司法领域文本的复杂结构和知识语义而导致司法文本分类的效能低问题,提出一种基于司法知识块摘要和词转移距离(WMD)的高效司法文档分类方法。首先为司法文书构建领域本体知识模型,进而基于领域本体,利用信息抽取技术获取司法文档中核心知识块摘要;然后基于司法文本的知识块摘要利用WMD进行司法文档相似度计算;最后利用K最近邻算法进行司法文本分类。以两个典型罪名的案件文档集作为实验数据,与传统的WMD文档相似度计算方法进行对比,实验结果表明,所提方法能明显提高司法文本分类的正确率(分别有5.5和9.9个百分点的提升),同时也降低了文档分类所需的时间(速度分别提升到原来的52.4和89.1倍)。

关键词: 智慧检务, 领域本体模型, 文本分类, 相似度计算, 知识块摘要, 词转移距离

CLC Number: