《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (5): 1317-1323.DOI: 10.11772/j.issn.1001-9081.2021030489

• 人工智能 •    下一篇

融合字注释的文本分类模型

杨先凤1(), 赵家和1, 李自强2   

  1. 1.西南石油大学 计算机科学学院,成都 610500
    2.四川师范大学 影视与传媒学院,成都 610066
  • 收稿日期:2021-03-31 修回日期:2021-07-08 接受日期:2021-07-21 发布日期:2022-06-11 出版日期:2022-05-10
  • 通讯作者: 杨先凤
  • 作者简介:杨先凤(1974—),女,四川南部人,教授,硕士,主要研究方向:计算机图像处理、智慧教育565695835@qq.com
    赵家和(1997—),男,陕西渭南人,硕士研究生,主要研究方向:自然语言处理
    李自强(1970—),四川青神人,教授,博士,CCF会员,主要研究方向:机器学习、智慧教育。
  • 基金资助:
    国家自然科学基金资助项目(61802321);四川省科技厅重点研发计划项目(2020YFN0019)

Text classification model combining word annotations

Xianfeng YANG1(), Jiahe ZHAO1, Ziqiang LI2   

  1. 1.School of Computer Science,Southwest Petroleum University,Chengdu Sichuan 610500,China
    2.College of Movie and Media,Sichuan Normal University,Chengdu Sichuan 610066,China
  • Received:2021-03-31 Revised:2021-07-08 Accepted:2021-07-21 Online:2022-06-11 Published:2022-05-10
  • Contact: Xianfeng YANG
  • About author:YANG Xianfeng,born in 1974,M. S.,professor. Her researchinterests include computer image processing,wisdom education.
    ZHAO Jiahe, born in 1997, M. S. candidate. His researchinterests include natural language processing.
    LI Ziqiang,born in 1970,Ph. D.,professor. His research interestsinclude machine learning,wisdom education.
  • Supported by:
    National Natural Science Foundation of China(61802321);Key Research and Development Program of Science and Technology Department of Sichuan Province(2020YFN0019)

摘要:

针对传统文本特征表示方法无法充分解决一词多义的问题,构建了一种融合字注释的文本分类模型。首先,借助现有中文字典,获取文本由字上下文选取的字典注释,并对其进行Transformer的双向编码器(BERT)编码来生成注释句向量;然后,将注释句向量与字嵌入向量融合作为输入层,并用来丰富输入文本的特征信息;最后,通过双向门控循环单元(BiGRU)学习文本的特征信息,并引入注意力机制突出关键特征向量。在公开数据集THUCNews和新浪微博情感分类数据集上进行的文本分类的实验结果表明,融合BERT字注释的文本分类模型相较未引入字注释的文本分类模型在性能上有显著提高,且在所有文本分类的实验模型中,所提出的BERT字注释_BiGRU_Attention模型有最高的精确率和召回率,能反映整体性能的F1-Score则分别高达98.16%和96.52%。

关键词: 一词多义, 字注释, 基于Transformer的双向编码器, 双向门控循环单元, 注意力机制, 文本分类

Abstract:

The traditional text feature representation method cannot fully solve the polysemy problem of word. In order to solve the problem, a new text classification model combining word annotations was proposed. Firstly, by using the existing Chinese dictionary, the dictionary annotations of the text selected by the word context were obtained, and the Bidirectional Encoder Representations from Transformers (BERT) encoding was performed on them to generate the annotated sentence vectors. Then, the annotated sentence vectors were integrated with the word embedding vectors as the input layer to enrich the characteristic information of the input text. Finally, the Bidirectional Gated Recurrent Unit (BiGRU) was used to learn the characteristic information of the input text, and the attention mechanism was introduced to highlight the key feature vectors. Experimental results of text classification on public THUCNews dataset and Sina weibo sentiment classification dataset show that, the text classification models combining BERT word annotations have significantly improved performance compared to the text classification models without combining word annotations, the proposed BERT word annotation _BiGRU_Attention model has the highest precision and recall in all the experimental models for text classification, and has the F1-Score of reflecting the overall performance up to 98.16% and 96.52% respectively.

Key words: polysemy, word annotation, Bidirectional Encoder Representations from Transformers (BERT), Bidirectional Gated Recurrent Unit (BiGRU), attention mechanism, text classification

中图分类号: