Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (11): 3379-3385.DOI: 10.11772/j.issn.1001-9081.2021112005
Special Issue: 第九届CCF大数据学术会议(CCF Bigdata 2021)
• CCF Bigdata 2021 • Previous Articles Next Articles
Xiayang SHI1, Fengyuan ZHANG1, Jiaqi YUAN2, Min HUANG1()
Received:
2021-11-25
Revised:
2021-12-31
Accepted:
2022-01-14
Online:
2022-01-19
Published:
2022-11-10
Contact:
Min HUANG
About author:
SHI Xiayang, born in 1978, Ph. D., lecturer. His research interests include natural language processing, machine translation.Supported by:
通讯作者:
黄敏
作者简介:
师夏阳(1978—),男,河南鲁山人,讲师,博士,CCF会员,主要研究方向:自然语言处理、机器翻译CLC Number:
Xiayang SHI, Fengyuan ZHANG, Jiaqi YUAN, Min HUANG. Detection of unsupervised offensive speech based on multilingual BERT[J]. Journal of Computer Applications, 2022, 42(11): 3379-3385.
师夏阳, 张风远, 袁嘉琪, 黄敏. 基于多语BERT的无监督攻击性言论检测[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3379-3385.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2021112005
语种 | 样本数据 |
---|---|
英语(English) | And this from the clown that should be in prison? |
丹麦语(Danish) | Og det fra klovnen, der burde være i fængsel? |
阿拉伯语(Arabic) | وهذا من المهرج الذي يجب أن يكون في السجن؟ |
土耳其语(Turkish) | Ve bu da hapiste olması gereken palyaçodan mı? |
希腊语(Greek) | Και αυτό από τον κλόουν που θα έπρεπε να είναι στη φυλακή; |
Tab. 1 Sample data for various languages
语种 | 样本数据 |
---|---|
英语(English) | And this from the clown that should be in prison? |
丹麦语(Danish) | Og det fra klovnen, der burde være i fængsel? |
阿拉伯语(Arabic) | وهذا من المهرج الذي يجب أن يكون في السجن؟ |
土耳其语(Turkish) | Ve bu da hapiste olması gereken palyaçodan mı? |
希腊语(Greek) | Και αυτό από τον κλόουν που θα έπρεπε να είναι στη φυλακή; |
语言 种类 | 训练集 | 测试集 | 训练与测 试集总和 | ||||
---|---|---|---|---|---|---|---|
攻击性 | 非攻 击性 | 训练 集和 | 攻击性 | 非攻 击性 | 测试 集和 | ||
English | 4 000 | 7 916 | 11 916 | 400 | 924 | 1 324 | 13 240 |
Danish | 344 | 2 320 | 2 664 | 40 | 256 | 296 | 2 960 |
Arabic | 1 395 | 5 660 | 7 055 | 155 | 629 | 784 | 7 839 |
Turkish | 5 441 | 22 708 | 28 149 | 605 | 2 523 | 3 128 | 31 277 |
Greek | 2 238 | 5 631 | 7 869 | 248 | 626 | 874 | 8 743 |
Tab. 2 Sample data distribution
语言 种类 | 训练集 | 测试集 | 训练与测 试集总和 | ||||
---|---|---|---|---|---|---|---|
攻击性 | 非攻 击性 | 训练 集和 | 攻击性 | 非攻 击性 | 测试 集和 | ||
English | 4 000 | 7 916 | 11 916 | 400 | 924 | 1 324 | 13 240 |
Danish | 344 | 2 320 | 2 664 | 40 | 256 | 296 | 2 960 |
Arabic | 1 395 | 5 660 | 7 055 | 155 | 629 | 784 | 7 839 |
Turkish | 5 441 | 22 708 | 28 149 | 605 | 2 523 | 3 128 | 31 277 |
Greek | 2 238 | 5 631 | 7 869 | 248 | 626 | 874 | 8 743 |
检测语言 | 模型 | Accuracy | F1 |
---|---|---|---|
Danish | 本文方法 | 0.796 | 0.619 |
BERT | 0.602 | 0.412 | |
LR | 0.563 | 0.407 | |
SVM | 0.615 | 0.444 | |
MLP | 0.592 | 0.441 | |
Arabic | 本文方法 | 0.764 | 0.508 |
BERT | 0.723 | 0.443 | |
LR | 0.651 | 0.220 | |
SVM | 0.735 | 0.499 | |
MLP | 0.672 | 0.478 | |
Turkish | 本文方法 | 0.730 | 0.553 |
BERT | 0.569 | 0.397 | |
LR | 0.521 | 0.336 | |
SVM | 0.602 | 0.449 | |
MLP | 0.551 | 0.401 | |
Greek | 本文方法 | 0.690 | 0.525 |
BERT | 0.580 | 0.418 | |
LR | 0.535 | 0.378 | |
SVM | 0.596 | 0.445 | |
MLP | 0.554 | 0.368 |
Tab. 3 Comparison of experimental results of different methods
检测语言 | 模型 | Accuracy | F1 |
---|---|---|---|
Danish | 本文方法 | 0.796 | 0.619 |
BERT | 0.602 | 0.412 | |
LR | 0.563 | 0.407 | |
SVM | 0.615 | 0.444 | |
MLP | 0.592 | 0.441 | |
Arabic | 本文方法 | 0.764 | 0.508 |
BERT | 0.723 | 0.443 | |
LR | 0.651 | 0.220 | |
SVM | 0.735 | 0.499 | |
MLP | 0.672 | 0.478 | |
Turkish | 本文方法 | 0.730 | 0.553 |
BERT | 0.569 | 0.397 | |
LR | 0.521 | 0.336 | |
SVM | 0.602 | 0.449 | |
MLP | 0.551 | 0.401 | |
Greek | 本文方法 | 0.690 | 0.525 |
BERT | 0.580 | 0.418 | |
LR | 0.535 | 0.378 | |
SVM | 0.596 | 0.445 | |
MLP | 0.554 | 0.368 |
语言 | Danish | Arabic | Turkish | Greek |
---|---|---|---|---|
English | 0.31 | 0.38 | 0.39 | 0.47 |
Greek | 0.25 | 0.36 | 0.29 | |
Turkish | 0.22 | 0.20 | 0.39 |
Tab. 4 GH distance
语言 | Danish | Arabic | Turkish | Greek |
---|---|---|---|---|
English | 0.31 | 0.38 | 0.39 | 0.47 |
Greek | 0.25 | 0.36 | 0.29 | |
Turkish | 0.22 | 0.20 | 0.39 |
方法 | Danish | Arabic | Turkish | Greek | ||||
---|---|---|---|---|---|---|---|---|
Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | |
有监督方法 | 0.825 | 0.709 | 0.801 | 0.692 | 0.768 | 0.672 | 0.791 | 0.683 |
本文方法(无监督方法) | 0.796 | 0.619 | 0.764 | 0.508 | 0.730 | 0.553 | 0.690 | 0.525 |
Tab. 5 Comparison of supervised method and proposed unsupervised method
方法 | Danish | Arabic | Turkish | Greek | ||||
---|---|---|---|---|---|---|---|---|
Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | |
有监督方法 | 0.825 | 0.709 | 0.801 | 0.692 | 0.768 | 0.672 | 0.791 | 0.683 |
本文方法(无监督方法) | 0.796 | 0.619 | 0.764 | 0.508 | 0.730 | 0.553 | 0.690 | 0.525 |
1 | MALMASI S, ZAMPIERI M. Challenges in discriminating profanity from hate speech[J]. Journal of Experimental and Theoretical Artificial Intelligence, 2018. 30(2): 187-202. 10.1080/0952813x.2017.1409284 |
2 | KUMAR R, OJHA A K, MALMASI S, et al. Benchmarking aggression identification in social media[C]// Proceedings of the 1st Workshop on Trolling, Aggression, and Cyberbullying. Stroudsburg, PA: Association for Computational Linguistics, 2018: 1-11. |
3 | NOBATA C, TETREAULT J, THOMAS A, et al. Abusive language detection in online user content[C]// Proceedings of the 25th International Conference on World Wide Web. Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee, 2016: 145-153. 10.1145/2872427.2883062 |
4 | ROSENTHAL S, ATANASOVA P, KARADZHOV G, et al. SOLID: a large‑scale semi‑supervised dataset for offensive language identification[C]// Findings of the Association for Computational Linguistics: ACL‑IJCNLP 2021. Stroudsburg, PA: Association for Computational Linguistics, 2021: 915-928. 10.18653/v1/2021.findings-acl.80 |
5 | MUBARAK H, RASHED A, DARWISH K, et al. Arabic offensive language on Twitter: analysis and experiments[C]// Proceedings of the 6th Arabic Natural Language Processing Workshop. Stroudsburg, PA: Association for Computational Linguistics, 2021: 126-135. |
6 | ÇÖLTEKIN Ç. A corpus of Turkish offensive language on social media[C]// Proceedings of the 12th Language Resources and Evaluation Conference. Paris: European Language Resources Association, 2020: 6174-6184. |
7 | CASULA C, PALMERO APROSIO A, MENINI S, et al. FBK‑DH at SemEval-2020 Task 12: using multi‑channel BERT for multilingual offensive language detection[C]// Proceedings of the 14th Workshop on Semantic Evaluation. [S.l.]: International Committee for Computational Linguistics, 2020: 1539-1545. 10.18653/v1/2020.semeval-1.201 |
8 | FENG F X Y, YANG Y F, CER D, et al. Language‑agnostic BERT sentence embedding[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2022: 878-891. 10.18653/v1/2022.acl-long.62 |
9 | PAMUNGKAS E W, PATTI V. Cross‑domain and cross‑lingual abusive language detection: a hybrid approach with deep learning and a multilingual lexicon[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Stroudsburg, PA: Association for Computational Linguistics,2019: 363-370. 10.18653/v1/p19-2051 |
10 | WARNER W, HIRSCHBERG J. Detecting hate speech on the world wide web[C]// Proceedings of the 2nd Workshop on Language in Social Media. Stroudsburg, PA: Association for Computational Linguistics, 2012: 19-26. |
11 | SAROJ A, PAL S. An Indian language social media collection for hate and offensive speech[C]// Proceedings of the 1st Workshop on Resources and Techniques for User and Author Profiling in Abusive Language. Paris: European Language Resources Association, 2020: 2-8. 10.18653/v1/2020.semeval-1.265 |
12 | PATHAK V, JOSHI M, JOSHI P A, et al. KBCNMUJAL@ HASOC‑Dravidian‑CodeMix‑FIRE2020: using machine learning for detection of hate speech and offensive code‑mixed social media text[EB/OL]. (2021-02-19) [2021-08-10].. |
13 | 苏金树,张博锋,徐昕. 基于机器学习的文本分类技术研究进展[J]. 软件学报, 2006, 17(9): 1848-1859. 10.1360/jos171848 |
SU J S, ZHANG B F, XU X. Advances in machine learning based text categorization[J]. Journal of Software, 2006, 17(9):1848-1859. 10.1360/jos171848 | |
14 | ZAMPIERI M, NAKOV P, ROSENTHAL S, et al. SemEval-2020 Task 12: multilingual offensive language identification in social media (OffensEval 2020)[C]// Proceedings of the 14th Workshop on Semantic Evaluation. [S.l.]: International Committee for Computational Linguistics, 2020: 1425-1447. 10.18653/v1/2020.semeval-1.188 |
15 | HOWARD J, RUDER S. Universal language model fine‑tuning for text classification[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 328-339. 10.18653/v1/p18-1031 |
16 | LIU P, LI W, ZOU L. NULI at SemEval-2019 Task 6: transfer learning for offensive language detection using bidirectional transformers[C]// Proceedings of the 13th International Workshop on Semantic Evaluation. Stroudsburg, PA: Association for Computational Linguistics, 2019: 87-91. 10.18653/v1/s19-2011 |
17 | PITENIS Z, ZAMPIERI M, RANASINGHE T. Offensive language identification in Greek[C]// Proceedings of the 12th Language Resources and Evaluation Conference. Paris: European Language Resources Association, 2020: 5113-5119. |
18 | NIKOLOV A, RADIVCHEV V. Nikolov‑Radivchev at SemEval-2019 Task 6: offensive tweet classification with BERT and ensembles[C]// Proceedings of the 13th International Workshop on Semantic Evaluation. Stroudsburg, PA: Association for Computational Linguistics, 2019: 691-695. 10.18653/v1/s19-2123 |
19 | MAHESHAPPA P, MATHEW B, SAHA P. Using knowledge graphs to improve hate speech detection[C]// Proceedings of the 3rd ACM India Joint International Conference on Data Science and Management of Data. New York: ACM, 2021: 430-430. 10.1145/3430984.3431072 |
20 | PHAM Q H, NGUYEN V A, DOAN L B, et al. From universal language model to downstream task: improving RoBERTa‑based Vietnamese hate speech detection[C]// Proceedings of the 12th International Conference on Knowledge and Systems Engineering. Piscataway: IEEE, 2020: 37-42. 10.1109/kse50997.2020.9287406 |
21 | AL‑MAKHADMEH Z, TOLBA A. Automatic hate speech detection using killer natural language processing optimizing ensemble deep learning approach[J]. Computing, 2020, 102(2):501-522. 10.1007/s00607-019-00745-0 |
22 | AYO F E, FOLORUNSO O, IBHARALU F T, et al. Hate speech detection in Twitter using hybrid embeddings and improved cuckoo search‑based neural networks[J]. International Journal of Intelligent Computing and Cybernetics, 2020, 13(4):485-525. 10.1108/ijicc-06-2020-0061 |
23 | KAPIL P, EKBAL A. A deep neural network based multi‑task learning approach to hate speech detection[J]. Knowledge‑Based Systems, 2020, 210: No.106458. 10.1016/j.knosys.2020.106458 |
24 | COLLA D, CASELLI T, BASILE V, et al. GruPaTo at SemEval-2020 Task 12: retraining mBERT on social media and fine‑tuned offensive language models[C]// Proceedings of the 14th Workshop on Semantic Evaluation. [S.l.]: International Committee for Computational Linguistics, 2020: 1546-1554. 10.18653/v1/2020.semeval-1.202 |
25 | KUDUGUNTA S, BAPNA A, CASWELL I, et al. Investigating multilingual NMT representations at scale[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2019: 1565-1575. 10.18653/v1/d19-1167 |
26 | KONDRATYUK D, STRAKA M. 75 languages, 1 model: parsing universal dependencies universally[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2019: 2779-2795. 10.18653/v1/d19-1279 |
27 | KUMAR A, SAUMYA S, SINGH J P. NITP‑AI‑NLP@HASOC‑ FIRE2020: fine tuned BERT for the hate speech and offensive content identification from social media[C]// Proceedings of the 12th Meeting of Forum for Information Retrieval Evaluation. Aachen: CEUR‑WS.org, 2020: 266-273. |
28 | LIBOVICKÝ J, ROSA R, FRASER A. How language‑neutral is multilingual BERT?[EB/OL]. (2019-11-08) [2021-08-10].. 10.18653/v1/2020.findings-emnlp.150 |
29 | ABE M, MIYAO J, KURITA T. q‑SNE: visualizing data using q‑Gaussian distributed stochastic neighbor embedding[C]// Proceedings of the 25th International Conference on Pattern Recognition. Piscataway: IEEE, 2021: 1051-1058. 10.1109/icpr48806.2021.9412900 |
30 | PATRA B, MONIZ J R A, GARG S, et al. Bilingual lexicon induction with semi‑supervision in non‑isometric embedding spaces[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA: Association for Computational Linguistics, 2019: 184-193. 10.18653/v1/p19-1018 |
31 | ZAMPIERI M, MALMASI S, NAKOV P, et al. SemEval-2019 Task 6: identifying and categorizing offensive language in social media (OffensEval)[C]// Proceedings of the 13th International Workshop on Semantic Evaluation. Stroudsburg, PA: Association for Computational Linguistics, 2019: 75-86. 10.18653/v1/s19-2010 |
[1] | Haihan WANG, Yan ZHU. Offensive speech detection with irony mechanism [J]. Journal of Computer Applications, 2024, 44(4): 1065-1071. |
[2] | Junjie ZHU, Li YU, Shengwen LI, Changzheng ZHOU. Technology term recognition with comprehensive constituency parsing [J]. Journal of Computer Applications, 2024, 44(4): 1072-1079. |
[3] | Bona XUAN, Jin LI, Yafei SONG, Zexuan MA. Malicious code classification method based on improved MobileNetV2 [J]. Journal of Computer Applications, 2023, 43(7): 2217-2225. |
[4] | Lanlan ZENG, Yisong WANG, Panfeng CHEN. Named entity recognition based on BERT and joint learning for judgment documents [J]. Journal of Computer Applications, 2022, 42(10): 3011-3017. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||