Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (11): 3379-3385.DOI: 10.11772/j.issn.1001-9081.2021112005

• CCF Bigdata 2021 • Previous Articles    

Detection of unsupervised offensive speech based on multilingual BERT

Xiayang SHI1, Fengyuan ZHANG1, Jiaqi YUAN2, Min HUANG1()   

  1. 1.College of Software Engineering,Zhengzhou University of Light Industry,Zhengzhou Henan 450001,China
    2.College of Mathematics and Information Science,Zhengzhou University of light industry,Zhengzhou Henan 450001,China
  • Received:2021-11-25 Revised:2021-12-31 Accepted:2022-01-14 Online:2022-01-19 Published:2022-11-10
  • Contact: Min HUANG
  • About author:SHI Xiayang, born in 1978, Ph. D., lecturer. His research interests include natural language processing, machine translation.
    ZHANG Fengyuan, born in 1998. Her research interests include natural language processing, machine translation.
    YUAN Jiaqi, born in 1996, M. S. candidate. Her research interests include natural language processing, multimodal machine translation.
    HUANG Min, born in 1972, Ph. D., professor. His research interests include data mining, information processing.
  • Supported by:
    Key Research and Development and Promotion Project of Henan Province(212102210547)

基于多语BERT的无监督攻击性言论检测

师夏阳1, 张风远1, 袁嘉琪2, 黄敏1()   

  1. 1.郑州轻工业大学 软件学院,郑州 450001
    2.郑州轻工业大学 数学与信息科学学院,郑州 450001
  • 通讯作者: 黄敏
  • 作者简介:师夏阳(1978—),男,河南鲁山人,讲师,博士,CCF会员,主要研究方向:自然语言处理、机器翻译
    张风远(1998—),女,河南许昌人,主要研究方向:自然语言处理、机器翻译
    袁嘉琪(1996—),女,河南许昌人,硕士研究生,主要研究方向:自然语言处理、多模态机器翻译
    黄敏(1972—),男,河南南阳人,教授,博士,主要研究方向:数据挖掘、信息处理。 huangmin@zzuli.edu.cn

Abstract:

Offensive speech has a serious negative impact on social stability. Currently, automatic detection of offensive speech focuses on a few high?resource languages, and the lack of sufficient offensive speech tagged corpus for low?resource languages makes it difficult to detect offensive speech in low?resource languages. In order to solve the above problem, a cross?language unsupervised offensiveness transfer detection method was proposed. Firstly, an original model was obtained by using the multilingual BERT (multilingual Bidirectional Encoder Representation from Transformers, mBERT) model to learn the offensive features on the high?resource English dataset. Then, by analyzing the language similarity between English and Danish, Arabic, Turkish, Greek, the obtained original model was transferred to the above four low?resource languages to achieve automatic detection of offensive speech on low?resource languages. Experimental results show that compared with the four methods of BERT, Linear Regression (LR), Support Vector Machine (SVM) and Multi?Layer Perceptron (MLP), the proposed method increases both the accuracy and F1 score of detecting offensive speech of languages such as Danish, Arabic, Turkish, and Greek by nearly 2 percentage points, which are close to those of the current supervised detection, showing that the combination of cross?language model transfer learning and transfer detection can achieve unsupervised offensiveness detection of low?resource languages.

Key words: cross?language model, offensive speech detection, BERT (Bidirectional Encoder Representation from Transformers), unsupervised method, Transfer Learning (TL)

摘要:

攻击性言论会对社会安定造成严重不良影响,但目前攻击性言论自动检测主要集中在少数几种高资源语言,对低资源语言缺少足够的攻击性言论标注语料导致检测困难,为此,提出一种跨语言无监督攻击性迁移检测方法。首先,使用多语BERT(mBERT)模型在高资源英语数据集上进行对攻击性特征的学习,得到一个原模型;然后,通过分析英语与丹麦语、阿拉伯语、土耳其语、希腊语的语言相似程度,将原模型迁移到这四种低资源语言上,实现对低资源语言的攻击性言论自动检测。实验结果显示,与BERT、线性回归(LR)、支持向量机(SVM)、多层感知机(MLP)这四种方法相比,所提方法在丹麦语、阿拉伯语、土耳其语、希腊语这四种语言上的攻击性言论检测的准确率和F1值均提高了近2个百分点,接近目前的有监督检测,可见采用跨语言模型迁移学习和迁移检测相结合的方法能够实现对低资源语言的无监督攻击性检测。

关键词: 跨语言模型, 攻击性言论检测, BERT, 无监督方法, 迁移学习

CLC Number: