《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (4): 1116-1124.DOI: 10.11772/j.issn.1001-9081.2021071257

• CCF第36届中国计算机应用大会 (CCF NCCA 2021) • 上一篇    

结合BERT和特征投影网络的新闻主题文本分类方法

张海丰1, 曾诚1,2,3(), 潘列1, 郝儒松1, 温超东1, 何鹏1,2,3   

  1. 1.湖北大学 计算机与信息工程学院,武汉 430062
    2.湖北省软件工程工程技术研究中心,武汉 430062
    3.智慧政务与人工智能应用湖北省工程研究中心,武汉 430062
  • 收稿日期:2021-07-16 修回日期:2021-11-11 接受日期:2021-11-17 发布日期:2022-04-15 出版日期:2022-04-10
  • 通讯作者: 曾诚
  • 作者简介:张海丰(1990—),男,湖北黄冈人,硕士研究生,主要研究方向:自然语言处理、文本分类
    潘列(1997—),男,湖北黄冈人,硕士研究生,主要研究方向:自然语言处理、文本分类
    郝儒松(1996—),男,河南开封人,硕士研究生,主要研究方向:自然语言处理、文本分类
    温超东(1996—),男,湖北荆州人,硕士研究生,CCF会员,主要研究方向:自然语言处理、文本分类
    何鹏(1988—),男,湖北武汉人,教授,博士,主要研究方向:人工智能、推荐系统。
  • 基金资助:
    国家自然科学基金资助项目(61977021)

News topic text classification method based on BERT and feature projection network

Haifeng ZHANG1, Cheng ZENG1,2,3(), Lie PAN1, Rusong HAO1, Chaodong WEN1, Peng HE1,2,3   

  1. 1.School of Computer Science and Information Engineering,Hubei University,Wuhan Hubei 430062,China
    2.Engineering and Technical Research Center of Hubei Province in Software Engineering,Wuhan Hubei 430062,China
    3.Engineering Research Center of Hubei Province in Intelligent Government Affairs and Application of Artificial Intelligence,Wuhan Hubei 430062,China
  • Received:2021-07-16 Revised:2021-11-11 Accepted:2021-11-17 Online:2022-04-15 Published:2022-04-10
  • Contact: Cheng ZENG
  • About author:ZHANG Haifeng, born in 1990, M. S. candidate. His research interests include natural language processing, text classification.
    PAN Lie, born in 1997, M. S. candidate. His research interests include natural language processing, text classification.
    HAO Rusong, born in 1996, M. S. candidate. His research interests include natural language processing, text classification.
    WEN Chaodong, born in 1996, M. S. candidate. His research interests include natural language processing, text classification.
    HE Peng, born in 1988, Ph. D., professor. His research interests include artificial intelligence, recommender system.
  • Supported by:
    National Natural Science Foundation of China(61977021)

摘要:

针对新闻主题文本用词缺乏规范、语义模糊、特征稀疏等问题,提出了结合BERT和特征投影网络(FPnet)的新闻主题文本分类方法。该方法包含两种实现方式:方式1将新闻主题文本在BERT模型的输出进行多层全连接层特征提取,并将最终提取到的文本特征结合特征投影方法进行提纯,从而强化分类效果;方式2在BERT模型内部的隐藏层中融合特征投影网络进行特征投影,从而通过隐藏层特征投影强化提纯分类特征。在今日头条、搜狐新闻、THUCNews-L、THUCNews-S数据集上进行实验,实验结果表明上述两种方式相较于基线BERT方法在准确率、宏平均F1值上均具有更好的表现,准确率最高分别为86.96%、86.17%、94.40%和93.73%,验证了所提方法的可行性和有效性。

关键词: 预训练语言模型, 文本分类, 新闻主题, BERT, 特征投影网络

Abstract:

Concerning the problems of the lack of standard words, fuzzy semantics and feature sparsity in news topic text, a news topic text classification method based on Bidirectional Encoder Representations from Transformers(BERT) and Feature Projection network(FPnet) was proposed. The method includes two implementation modes. In mode 1: the multiple-layer fully connected layer features were extracted from the output of news topic text at BERT model, and the final extracted text features were purified with the combination of feature projection method, thereby strengthening the classification effect. In mode 2, the feature projection network was fused in the hidden layer inside the BERT model for feature projection, so that the classification features were enhanced and purified through the hidden layer feature projection. Experimental results on Toutiao, Sohu News, THUCNews-L、THUCNews-S datasets show that the two above modes have better performance in accuracy and macro-averaging F1 value than baseline BERT method with the highest accuracy reached 86.96%, 86.17%, 94.40% and 93.73% respectively, which proves the feasibility and effectiveness of the proposed method.

Key words: pre-trained language model, text classification, news topic, Bidirectional Encoder Representations from Transformers (BERT), Feature Projection network (FPnet)

中图分类号: