Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (2): 407-412.DOI: 10.11772/j.issn.1001-9081.2020050730

Special Issue: 数据科学与技术

• Data science and technology • Previous Articles     Next Articles

Patent text classification based on ALBERT and bidirectional gated recurrent unit

WEN Chaodong1, ZENG Cheng1,2,3, REN Junwei1, ZHANG Yan1,2,3   

  1. 1. School of Computer Science and Information Engineering, Hubei University, Wuhan Hubei 430062, China;
    2. Hubei Province Engineering Technology Research Center for Software Engineering, Wuhan Hubei 430062, China;
    3. Hubei Engineering Research Center for Smart Government and Artificial Intelligence, Wuhan Hubei 430062, China
  • Received:2020-06-01 Revised:2020-07-22 Online:2021-02-10 Published:2020-08-14
  • Supported by:
    This work is partially supported by the Surface Program of National Natural Science Foundation of China (61977021), the Youth Program of National Natural Science Foundation of China (61902114), the 2019 Hubei Special Project of Technology Innovation (2019ACA144).

结合ALBERT和双向门控循环单元的专利文本分类

温超东1, 曾诚1,2,3, 任俊伟1, 1,2,3   

  1. 1. 湖北大学 计算机与信息工程学院, 武汉 430062;
    2. 湖北省软件工程工程技术研究中心, 武汉 430062;
    3. 湖北省智慧政务与人工智能应用工程研究中心, 武汉 430062
  • 通讯作者: 曾诚
  • 作者简介:温超东(1996-),男,湖北荆州人,硕士研究生,CCF学生会员,主要研究方向:自然语言处理、文本分类;曾诚(1976-),男,湖北武汉人,教授,博士,CCF会员,主要研究方向:人工智能、行业软件;任俊伟(1992-),男,湖北宜昌人,硕士研究生,主要研究方向:自然语言处理、推荐系统;张?(1973-),男,湖北宜昌人,博士,主要研究方向:人工智能、信息安全。
  • 基金资助:
    国家自然科学基金面上项目(61977021);国家自然科学基金青年科学基金资助项目(61902114);2019年湖北省技术创新专项(2019ACA144)。

Abstract: With the rapid increase in the number of patent applications, the demand for automatic classification of patent text is increasing. Most of the existing patent text classification algorithms utilize methods such as Word2vec and Global Vectors (GloVe) to obtain the word vector representation of the text, while a lot of word position information is abandoned and the complete semantics of the text cannot be expressed. In order to solve these problems, a multilevel patent text classification model named ALBERT-BiGRU was proposed by combining ALBERT (A Lite BERT) and BiGRU (Bidirectional Gated Recurrent Unit). In this model, dynamic word vector pre-trained by ALBERT was used to replace the static word vector trained by traditional methods like Word2vec, so as to improve the representation ability of the word vector. Then, the BiGRU neural network model was used for training, which preserved the semantic association between long-distance words in the patent text to the greatest extent. In the effective verification on the patent text dataset published by State Information Center, compared with Word2vec-BiGRU and GloVe-BiGRU, the accuracy of ALBERT-BiGRU was increased by 9.1 percentage points and 10.9 percentage points respectively at the department level of patent text, and was increased by 9.5 percentage points and 11.2 percentage points respectively at the big class level. Experimental results show that ALBERT-BiGRU can effectively improve the classification effect of patent texts of different levels.

Key words: patent text, text classification, A Lite BERT (ALBERT), Bidirectional Gated Recurrent Unit (BiGRU), word vector

摘要: 随着专利申请数量的快速增长,对专利文本实现自动分类的需求与日俱增。现有的专利文本分类算法大都采用Word2vec和全局词向量(GloVe)等方式获取文本的词向量表示,舍弃了大量词语的位置信息且不能表示出文本的完整语义。针对上述问题,提出了一种结合ALBERT和双向门控循环单元(BiGRU)的多层级专利文本分类模型ALBERT-BiGRU。该模型使用ALBERT预训练的动态词向量代替传统Word2vec等方式训练的静态词向量,提升了词向量的表征能力;并使用BiGRU神经网络模型进行训练,最大限度保留了专利文本中长距离词之间的语义关联。在国家信息中心公布的专利数据集上进行有效性验证,与Word2vec-BiGRU和GloVe-BiGRU相比,ALBERT-BiGRU的准确率在专利文本的部级别分别提高了9.1个百分点和10.9个百分点,在大类级别分别提高了9.5个百分点和11.2个百分点。实验结果表明,ALBERT-BiGRU能有效提升不同层级专利文本的分类效果。

关键词: 专利文本, 文本分类, ALBERT, 双向门控循环单元, 词向量

CLC Number: