《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (1): 71-77.DOI: 10.11772/j.issn.1001-9081.2021010122

• 人工智能 • 上一篇    下一篇

基于层级多任务BERT的海关报关商品分类算法

阮启铭1, 过弋1,2,3(), 郑楠1, 王业相1   

  1. 1.华东理工大学 信息科学与工程学院,上海 200237
    2.大数据流通与交易技术国家工程实验室-商业智能与可视化技术研究中心,上海 200436
    3.上海大数据与互联网受众工程技术研究中心,上海 200072
  • 收稿日期:2021-01-22 修回日期:2021-05-17 接受日期:2021-06-04 发布日期:2021-07-14 出版日期:2022-01-10
  • 通讯作者: 过弋
  • 作者简介:阮启铭(1997—),男,浙江绍兴人,硕士研究生,CCF会员,主要研究方向:自然语言处理、文本分类、推荐系统
    过弋(1975—),男,江苏无锡人,教授,博士,CCF高级会员,主要研究方向:文本挖掘、知识发现、商业智能分析
    郑楠(1997—),女,山东威海人,硕士研究生,主要研究方向:数据挖掘、推荐系统
    王业相(1996—),男,浙江温州人,硕士研究生,主要研究方向:自然语言处理、对话系统。
  • 基金资助:
    上海市科学技术委员会科研计划项目(17DZ1101003)

Customs declaration good classification algorithm based on hierarchical multi-task BERT

Qiming RUAN1, Yi GUO1,2,3(), Nan ZHENG1, Yexiang WANG1   

  1. 1.School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
    2.National Engineering Laboratory for Big Data Distribution and Exchange Technologies-Business Intelligence and Visualization Research Center,Shanghai 200436,China
    3.Shanghai Engineering Research Center of Big Data & Internet Audience,Shanghai 200072,China
  • Received:2021-01-22 Revised:2021-05-17 Accepted:2021-06-04 Online:2021-07-14 Published:2022-01-10
  • Contact: Yi GUO
  • About author:RUAN Qiming, born in 1997, M. S. candidate. His research interests include natural language processing, text classification, recommender system.
    GUO Yi, born in 1975, Ph. D., professor. His research interests include text mining, knowledge discovery, business intelligence analysis.
    ZHENG Nan, born in 1997, M. S. candidate. Her research interests include data mining, recommender system.
    WANG Yexiang, born in 1996, M. S. candidate. His research interests include natural language processing, dialogue system.
  • Supported by:
    Scientific Research Program of Science & Technology Commission of Shanghai Municipality(17DZ1101003)

摘要:

海关商品申报场景下,需采用分类模型将商品归类为统一的海关(HS)编码。然而现有海关商品分类模型忽略了待分类文本中词语的位置信息,同时HS编码数以万计,会导致类别向量稀疏、模型收敛速度慢等问题。针对上述问题,结合真实业务场景下人工逐层归类策略,充分利用HS编码的层次结构特点,提出了一种基于层级多任务BERT(HM-BERT)的分类模型。一方面通过BERT模型的动态词向量获取了报关商品文本中的位置信息,另一方面利用HS编码不同层级的类别信息对BERT模型进行多任务训练,以提高归类的准确性和收敛性。在国内某报关服务商2019年的报关数据集上进行的所提模型的有效性验证,相比BERT模型,HM-BERT模型的准确率提高了2个百分点,在模型训练速度上也有所提升;与同样分层级的H-fastText相比,准确率提高了7.1个百分点。实验结果表明,HM-BERT模型能有效改善海关报关商品的分类效果。

关键词: 海关编码, 多任务学习, 文本分类, BERT, 向量稀疏

Abstract:

In the customs good declaration scenarios, a classification model needs to be used to categorize the goods into uniform Harmonized System (HS) codes. However, the existing customs good classification models ignore the location information of words in the text to be classified, while the HS codes are in tens of thousands, which leads to problems such as class vector sparsity and slow convergence of the model.To address the above problems, a classification model based on Hierarchical Multi-task Bidirectional Encoder Representation from Transformers (HM-BERT) was proposed by combining the manual hierarchical classification strategy in real business scenarios and making full use of the hierarchical structure feature of HS codes. In one aspect, the dynamic word vector of Bidirectional Encoder Representation from Transformers (BERT) model was used to obtain the location information in the text of customs declaration goods. In other aspect, the accuracy and convergence of categorization were improved by making full use of the category information of different levels of HS codes to perform multi-task training of BERT model. In the effectiveness verification of the proposed model on the 2019 customs declaration dataset of a domestic customs service provider, HM-BERT model improves 2 percentage points in accuracy with faster training speed compared to BERT model, and improves 7.1 percentage points in accuracy compared with H (Hierarchical)-fastText. Experimental results show that HM-BERT model can effectively improve the classification effect of customs declaration goods.

Key words: Harmonized System (HS) code, Multi-Task Learning (MTL), text classification, Bidirectional Encoder Representation from Transformers (BERT), vector sparsity

中图分类号: