语义驱动的司法文档学习分类方法

doi:10.11772/j.issn.1001-9081.2018109193

计算机应用 ›› 2019, Vol. 39 ›› Issue (6): 1696-1700.DOI: 10.11772/j.issn.1001-9081.2018109193

语义驱动的司法文档学习分类方法

马建刚^1,2,3, 马应龙⁴

1. 中国人民大学法学院, 北京 100872;
2. 国家检察官学院, 北京 102206;
3. 河南省人民检察院, 郑州 450004;
4. 华北电力大学控制与计算机工程学院, 北京 102206

收稿日期:2018-11-15 修回日期:2019-01-03 发布日期:2019-06-17 出版日期:2019-06-10
通讯作者: 马建刚
作者简介:马建刚(1977-),男,河南郑州人,高级工程师,博士,CCF高级会员,主要研究方向:大数据、智慧检务、智慧司法;马应龙(1976-),男,陕西咸阳人,教授,博士,CCF高级会员,主要研究方向:大数据、知识工程。
基金资助:
国家重点研发计划项目（2018YFC0831404，2018YFC0830605）；中国博士后科学基金资助项目（2016M591317）。

Semantic-driven learning and classification method of judicial documents

MA Jiangang^1,2,3, MA Yinglong⁴

1. Law School, Renmin University of China, Beijing 100872, China;
2. National Prosecutors College of P. R. C., Beijing 102206, China;
3. The People's Procuratorate of Henan Province, Zhengzhou Henen 450004, China;
4. School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China

Received:2018-11-15 Revised:2019-01-03 Online:2019-06-17 Published:2019-06-10
Supported by:
This work is partially supported by the National Key R&D Program of China (2018YFC0831404, 2018YFC0830605), the Postdoctoral Science Foundation of China (2016M591317).

摘要/Abstract

摘要： 基于海量的司法文书进行的高效司法文档分类有助于目前的司法智能化应用，如类案推送、文书检索、判决预测和量刑辅助等。面向通用领域的文本分类方法因没有考虑司法领域文本的复杂结构和知识语义，导致司法文本分类的效能很低。针对该问题提出了一种语义驱动的方法来学习和分类司法文书。首先，提出并构建了面向司法领域的领域知识模型以清晰表达文档级语义；然后，基于该模型对司法文档进行相应的领域知识抽取；最后，利用图长短期记忆模型（Graph LSTM）对司法文书进行训练和分类。实验结果表明该方法在准确率和召回率方面明显优于常用的长短期记忆（LSTM）模型、多类别逻辑回归和支持向量机等方法。

关键词: 司法大数据, 领域知识模型, 文本分类, 智慧检务, 图长短期记忆模型

Abstract: Efficient document classification techniques based on large-scale judicial documents are crucial to current judicial intelligent application, such as similar case pushing, legal document retrieval, judgment prediction and sentencing assistance. The general-domain-oriented document classification methods are lack of efficiency because they do not consider the complex structure and knowledge semantics of judicial documents. To solve this problem, a semantic-driven method was proposed to learn and classify judicial documents. Firstly, a domain knowledge model oriented to judicial domain was proposed and constructed to express the document-level semantics clearly. Then, domain knowledge was extracted from the judicial documents based on the model. Finally, the judicial documents were trained and classified by using Graph Long Short-Term Memory (Graph LSTM) model. The experimental results show that, the proposed method is superior to Long Short-Term Memory (LSTM) model, Multinomial Logistic Regression (MLR) and Support Vector Machine (SVM) in accuracy and recall.

Key words: judicial big data, domain knowledge model, text categorization, smart procuratorate, Graph Long Short-Term Memory (Graph LSTM) model

中图分类号:

TP309

马建刚, 马应龙. 语义驱动的司法文档学习分类方法[J]. 计算机应用, 2019, 39(6): 1696-1700.

MA Jiangang, MA Yinglong. Semantic-driven learning and classification method of judicial documents[J]. Journal of Computer Applications, 2019, 39(6): 1696-1700.

参考文献

[1] 马建刚.检察实务中的大数据[M].北京:中国检察出版社,2017:17-23.(MA J G. Procuratorial Big Data[M]. Beijing:China Procurational Press, 2017:17-23.)
[2] BOELLA G, CARO L D, HUMPHREYS L, et al. Eunomos, a legal document and knowledge management system for the Web to provide relevant, reliable and up-to-date information on the law[J]. Artificial Intelligence and Law, 2016, 24(3):245-283.
[3] JING L P, HUANG H K, SHI H B. Improved feature selection approach TF-IDF in text mining[C]//Proceedings of the 2003 International Conference on Machine Learning and Cybernetics. Piscataway, NJ:IEEE, 2003:944-946.
[4] GALGANI F, COMPTON P, HOFFMANN A. LEXA:building knowledge bases for automatic legal citation classification[J]. Expert Systems with Applications, 2015, 42(17/18):6391-6407.
[5] HAMMOUDA K M, KAMEL M S. Phrase-based document similarity based on an index graph model[C]//Proceedings of the 2002 IEEE International Conference on Data Mining. Washington, DC:IEEE Computer Society, 2002:203-210.
[6] BLEI D M, NG A Y, JORDAN M I, et al. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(4/5):993-1022.
[7] ROITBLAT H L, KERSHAW A, OOT P. Document categorization in legal electronic discovery:computer classification vs. manual review[J]. Journal of the American Society for Information Science and Technology, 2010, 61(1):70-80.
[8] NOORTWIJK K V, NOORTWIJK K C. Automatic document classification in integrated legal content collections[C]//ICAIL 2017:Proceedings of the 16th International Conference on Artificial Intelligence and Law. New York:ACM, 2017:129-134.
[9] SULEA O, ZAMPIERI M, MALMASI S, et al. Exploring the use of text classification in the legal domain[C]//ASAIL 2017:Proceedings of the Second Workshop on Automated Detection, Extraction and Analysis of Semantic Information in Legal Texts. New York:ACM, 2017:419-424.
[10] SARIC F, DALBELO BASIC B, MOENS M F, et al. Multi-label classification of croatian legal documents using EuroVoc thesaurus[C]//SPLeT 2014:Proceedings of the 2014 Workshop on Semantic Processing of Legal Texts. Reykjavik:European Language Resources Association, 2014:716-723.
[11] BAJWA I S, KARIM F, NAEEM M A, et al. A semi supervised approach for catchphrase classification in legal text documents[J]. Journal of Computers, 2017, 12(5):451-461.
[12] SILVESTRO L D, SPAMPINATO D, TORRISI A. Automatic classification of legal textual documents using C4.5[EB/OL].[2018-10-15]. http://www.ittig.cnr.it/Ricerca/Testi/Spampinato-Di_Silvestro-Torrisi2009.pdf.
[13] NALLAPATI R, MANNING C D. Legal docket-entry classifica-tion:where machine learning stumbles[C]//EMNLP 2008:Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2008:438-446.
[14] 马建刚,张鹏,马应龙.基于知识块摘要和词转移距离的高效司法文档分类[J].计算机应用,2019,39(5):1293-1298.(MA J G, ZHANG P, MA Y L. Efficient judicial document classification based on knowledge block summarization and word mover's distance[J]. Journal of Computer Applications, 2019, 39(5):1293-1298.)
[15] PENG N, POON H, QUIRK C, et al. Cross-sentence n-ary relation extraction with graph LSTMs[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2017:101-115.
[16] SUN J J. Jieba Chinese word segmentation tool[EB/OL].[2018-10-15]. https://github.com/fxsjy/jieba.

语义驱动的司法文档学习分类方法

Semantic-driven learning and classification method of judicial documents

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	张洋, 江铭虎. 基于注意力机制的文本作者识别[J]. 计算机应用, 2021, 41(7): 1897-1901.
[2]	温超东, 曾诚, 任俊伟, 张. 结合ALBERT和双向门控循环单元的专利文本分类[J]. 计算机应用, 2021, 41(2): 407-412.
[3]	张阳, 王小宁. 基于Word2Vec词嵌入和高维生物基因选择遗传算法的文本特征选择方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3151-3155.
[4]	尹春勇, 何苗. 基于改进胶囊网络的文本分类[J]. 计算机应用, 2020, 40(9): 2525-2530.
[5]	廖胜兰, 殷实, 陈小平, 张波, 欧阳昱, 张衡. 面向电力业务对话系统的意图识别数据集[J]. 计算机应用, 2020, 40(9): 2549-2554.
[6]	王敏蕊, 高曙, 袁自勇, 袁蕾. 基于动态路由序列生成模型的多标签文本分类方法[J]. 计算机应用, 2020, 40(7): 1884-1890.
[7]	李鸣, 郭晨皓, 陈星. 视觉类深度神经网络的自动标注[J]. 计算机应用, 2020, 40(6): 1593-1600.
[8]	王留洋, 俞扬信, 陈伯伦, 章慧. 基于共识和分类改善文档聚类的识别信息方法[J]. 计算机应用, 2020, 40(4): 1069-1073.
[9]	张小川, 戴旭尧, 刘璐, 冯天硕. 融合多头自注意力机制的中文短文本分类模型[J]. 计算机应用, 2020, 40(12): 3485-3489.
[10]	马建刚, 张鹏, 马应龙. 基于知识块摘要和词转移距离的高效司法文档分类[J]. 计算机应用, 2019, 39(5): 1293-1298.
[11]	邱宁佳, 丛琳, 周思丞, 王鹏, 李岩芳. 结合改进主动学习的SVD-CNN弹幕文本分类算法[J]. 计算机应用, 2019, 39(3): 644-650.
[12]	唐小川, 邱曦伟, 罗亮. 基于交互作用的文本分类特征选择算法[J]. 计算机应用, 2018, 38(7): 1857-1861.
[13]	卢玲, 杨武, 王远伦, 雷子鉴, 李莹. 结合注意力机制的长文本分类方法[J]. 计算机应用, 2018, 38(5): 1272-1277.
[14]	张忠林, 刘述昌, 江粉桃. 深层次分类中候选类别搜索算法[J]. 计算机应用, 2017, 37(3): 635-639.
[15]	俸世洲, 周尚波. 基于深度自编码网络的高校招生咨询算法[J]. 计算机应用, 2017, 37(11): 3323-3329.