Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (1): 57-63.DOI: 10.11772/j.issn.1001-9081.2021020366

• Artificial intelligence • Previous Articles     Next Articles

Text multi-label classification method incorporating BERT and label semantic attention

Xueqiang LYU, Chen PENG, Le ZHANG(), Zhi’an DONG, Xindong YOU   

  1. Beijing Key Laboratory of Internet Culture and Digital Dissemination Research (Beijing Information Science and Technology University),Beijing 100101,China
  • Received:2021-03-11 Revised:2021-04-28 Accepted:2021-04-29 Online:2021-05-21 Published:2022-01-10
  • Contact: Le ZHANG
  • About author:LYU Xueqiang, born in 1970, Ph. D., professor. His research interests include Chinese and multimedia information processing.
    PENG Chen, born in 1996, M. S. candidate. His research interests include natural language processing.
    ZHANG Le, born in 1988, Ph. D., associate professor. Her research interests include natural language processing, Web user behavior analysis.
    DONG Zhi’an, born in 1989, M. S., research fellow. His research interests include natural language processing.
    YOU Xindong, born in 1979, Ph. D., associate professor. Her research interests include natural language processing, text mining, data classification.
  • Supported by:
    Natural Science Foundation of Beijing(4212020);Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province/ Open Project Fund of Key Laboratory of Tibetan Information Processing of Ministry of Education(2019Z002)

融合BERT与标签语义注意力的文本多标签分类方法

吕学强, 彭郴, 张乐(), 董志安, 游新冬   

  1. 网络文化与数字传播北京市重点实验室(北京信息科技大学),北京 100101
  • 通讯作者: 张乐
  • 作者简介:吕学强(1970—),男,山东鱼台人,教授,博士,CCF会员,主要研究方向:中文与多媒体信息处理
    彭郴(1996—),男,湖北黄石人,硕士研究生,主要研究方向:自然语言处理
    张乐(1988—),女,河北石家庄人,副教授,博士,主要研究方向:自然语言处理、网络用户行为分析
    董志安(1989—),男,辽宁抚顺人,研究员,硕士,主要研究方向:自然语言处理
    游新冬(1979—),女,福建永定人,副教授,博士,CCF会员,主要研究方向:自然语言处理、文本挖掘、数据分类。
  • 基金资助:
    北京市自然科学基金资助项目(4212020);青海省藏文信息处理与机器翻译重点实验室/藏文信息处理教育部重点实验室开放课题基金资助项目(2019Z002)

Abstract:

Multi-Label Text Classification (MLTC) is one of the important subtasks in the field of Natural Language Processing (NLP). In order to solve the problem of complex correlation between multiple labels, an MLTC method TLA-BERT was proposed by incorporating Bidirectional Encoder Representations from Transformers (BERT) and label semantic attention. Firstly, the contextual vector representation of the input text was learned by fine-tuning the self-coding pre-training model. Secondly, the labels were encoded individually by using Long Short-Term Memory (LSTM) neural network. Finally, the contribution of text to each label was explicitly highlighted with the use of an attention mechanism in order to predict the multi-label sequences. Experimental results show that compared with Sequence Generation Model (SGM) algorithm, the proposed method improves the F value by 2.8 percentage points and 1.5 percentage points on the Arxiv Academic Paper Dataset (AAPD) and Reuters Corpus Volume I (RCV1)-v2 public dataset respectively.

Key words: multi-label classification, Bidirectional Encoder Representations from Transformers (BERT), label semantic information, Bidirectional Long Short-Term Memory (BiLSTM) neural network, attention mechanism

摘要:

多标签文本分类(MLTC)是自然语言处理(NLP)领域的重要子课题之一。针对多个标签之间存在复杂关联性的问题,提出了一种融合BERT与标签语义注意力的MLTC方法TLA-BERT。首先,通过对自编码预训练模型进行微调,从而学习输入文本的上下文向量表示;然后,使用长短期记忆(LSTM)神经网络将标签进行单独编码;最后,利用注意力机制显性突出文本对每个标签的贡献,以预测多标签序列。实验结果表明,与基于序列生成模型(SGM)算法相比,所提出的方法在AAPD与RCV1-v2公开数据集上,F1值分别提高了2.8个百分点与1.5个百分点。

关键词: 多标签分类, BERT, 标签语义信息, 双向长短期记忆神经网络, 注意力机制

CLC Number: