计算机应用 ›› 2021, Vol. 41 ›› Issue (1): 139-144.DOI: 10.11772/j.issn.1001-9081.2020061066

所属专题: 第八届中国数据挖掘会议(CCDM 2020)

• 第八届中国数据挖掘会议(CCDM 2020) • 上一篇    下一篇

基于BERT的不完全数据情感分类

罗俊1,2, 陈黎飞1,2   

  1. 1. 福建师范大学 数学与信息学院, 福州 350117;
    2. 数字福建环境监测物联网实验室(福建师范大学), 福州 350117
  • 收稿日期:2020-05-31 修回日期:2020-08-03 出版日期:2021-01-10 发布日期:2020-11-12
  • 通讯作者: 罗俊
  • 作者简介:罗俊(1995-),男,江西南昌人,硕士研究生,主要研究方向:数据挖掘、自然语言处理;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别。
  • 基金资助:
    福建省自然科学基金资助项目(2015J01238);福建师范大学创新团队项目(IRTL1704)。

Sentiment classification of incomplete data based on bidirectional encoder representations from transformers

LUO Jun1,2, CHEN Lifei1,2   

  1. 1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou Fujian 350117, China;
    2. Digital Fujian Internet-of-Things Laboratory of Environmental Monitoring(Fujian Normal University), Fuzhou Fujian 350117, China
  • Received:2020-05-31 Revised:2020-08-03 Online:2021-01-10 Published:2020-11-12
  • Supported by:
    This work is partially supported by the Natural Science Foundation of Fujian Province (2015J01238), the Innovation Team Project of Fujian Normal University (IRTL1704).

摘要: 不完全数据,如社交平台的互动信息、互联网电影资料库中的影评内容,广泛存在于现实生活中。而现有情感分类模型大多建立在完整的数据集上,没有考虑不完整数据对分类性能的影响。针对上述问题提出基于BERT的栈式降噪神经网络模型,用于面向不完全数据的情感分类。该模型由栈式降噪自编码器(SDAE)和BERT两部分组成。首先将经词嵌入处理的不完全数据输入到SDAE中进行去噪训练,以提取深层特征来重构缺失词和错误词的特征表示;接着将所得输出传入BERT预训练模型中进行精化以进一步改进词的特征向量表示。在两个常用的情感数据集上的实验结果表明,所提方法在不完全数据情感分类中的F1值和准确率分别提高了约6%和5%,验证了所提模型的有效性。

关键词: 不完全数据, 情感分类, BERT, 栈式降噪自编码器, 预训练模型

Abstract: Incomplete data, such as the interactive information on social platforms and the review contents in Internet movie datasets, widely exist in the real life. However, most existing sentiment classification models are built on the basis of complete data, without considering the impact of incomplete data on classification performance. To address this problem, a stacked denoising neural network model based on BERT (Bidirectional Encoder Representations from Transformers) was proposed for sentiment classification of incomplete data. This model was composed of two components:Stacked Denoising AutoEncoder (SDAE) and BERT. Firstly, the incomplete data processed by word-embedding was fed to the SDAE for denoising training in order to extract deep features to reconstruct the feature representation of the missing words and wrong words. Then, the obtained output was passed into the BERT pre-training model to further improve the feature vector representation of the words by refining. Experimental results on two commonly used sentiment datasets demonstrate that the proposed method has the F1 measure and classification accuracy in incomplete data classification improved by about 6% and 5% respectively, thus verifying the effectiveness of the proposed model.

Key words: incomplete data, sentiment classification, BERT (Bidirectional Encoder Representations from Transformers), Stacked Denoising AutoEncoder (SDAE), pre-training model

中图分类号: