基于BERT的图像和文本多模态融合分类模型

doi:10.11772/j.issn.1001-9081.2022091362

摘要/Abstract

摘要：

在临床诊断过程中，医生会同时结合医学图像和病理报告文本综合判定病情。针对现有的人工智能（AI）辅助诊断系统未充分利用文本检查内容的问题，提出一种基于BERT模型的图文多模态分类模型（ITMMB），在特征层实现医学图像和病理文本的多模态融合和分类。采用残差网络（ResNet）对图像预处理获得图像词嵌入向量，同时采用分词技术处理文本获得文本嵌入词向量，并将两类嵌入词向量送入BERT模型完成最终分类；此外，为适应BERT模型需要并获得更好的分类性能，优化了ResNet的残差模块、学习权重、损失函数和池化层。在Open Images数据集上的实验结果表明，与仅通过单一的医学图像或病理文本进行辅助诊断的模型相比，ITMMB的微平均F1分数分别提高38.76和4.66个百分点，能有效辅助医生临床诊断。

关键词: 多模态融合, 残差网络, 图像分类, 文字分类, 特征提取, BERT

Abstract:

In the process of clinical diagnosis， both medical images and pathological report texts are used by doctors to judge the condition comprehensively. Aiming at the problem that the existing Artificial Intelligence （AI） aided diagnosis system do not make full use of the text inspection content， an Image-Text Multi-Modality model based on Bidirectional Encoder Representation from Transformer （ITMMB）was proposed to realize the multi-modality fusion and classification of medical images and pathological texts at the feature layer. Residual Network （ResNet） was adopted to pre-process the medical image to get image word embedded vectors. At the same time， word segmentation technology was adopted to process pathological text to get text word embedded vectors. Two types of word embedded vectors were both input into BERT model for final classification. In addition， residual module， learning weight， loss function and pool layer of ResNet were optimized for BERT requirement and better classification. Experimental results on Open Images dataset show that compared with aided diagnosis model only using medical images or pathological texts， ITMMB improves Micro average F1 score by 38.76 and 4.66 percentage points respectively， which can effectively improve the clinical diagnosis efficiency of doctors.

Key words: multi-modality fusion, Residual Network (ResNet), image classification, text classification, feature extraction, BERT（Bidirectional Encoder Representations from Transformer）

中图分类号:

TP391.1

李佳欣, 苏曙光. 基于BERT的图像和文本多模态融合分类模型[J]. 计算机应用, 2023, 43(S1): 39-44.

Jiaxin LI, Shuguang SU. Image and text multi-modality fusion classification model based on BERT[J]. Journal of Computer Applications, 2023, 43(S1): 39-44.

图/表 13

参考文献 16

1	DEVLIN J， CHANG M W， LEE K，et al ．BERT： pre-train-ing of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies．Stroudsburg，PA： Association for Computational Linguistics， 2019： 4171-4186．
2	GARGEYA R， LENG T. Automated identification of diabetic retinopathy using deep learning［J］. Ophthalmology， 2017， 124（7）： 962-969. 10.1016/j.ophtha.2017.02.008
3	李丰男，孟祥茹，焦艳菲.基于多特征融合Single-Pass-SOM组合模型的话题检测［J］.计算机系统应用，2020，29（7）：245-250. 10.15888/j.cnki.csa.007508]
4	陈兴蜀，马晨曦，王文贤. 基于改进的ccLDA多数据源热点话题检测模型［J］.工程科学与技术，2018，50（2）：141-147.
5	EL-BANA S， AL-KABBANY A， SHARKAS M. A multi-task pipeline with specialized streams for classification and segmentation of infection manifestations in COVID-19 scans ［J］. PeerJ Computer Science， 2020，6： No.303. 10.7717/peerj-cs.303
6	REIMERS N， GUREVYCH I. Sentence-BERT： sentence embeddings using siamese BERT networks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing. Stroudsburg，PA： Association for Computational Linguistics， 2020：3982-3992. 10.18653/v1/d19-1410
7	ISENSEE F， KICKINGEREDER P， WICK W， et al. Brain tumor segmentation and radiomics survival prediction： Contribution to the BRATS 2017 challenge［C］// Proceedings of the 2017 International Medical Image Computing and Computer Assisted Intervention Society Brainlesion Workshop. Cham： Springer， 2017： 287-297. 10.1007/978-3-319-75238-9_25
8	JOY，OH A. Aspect and sentiment unification model for online review analysis［C］// Proceedings of the 4th ACM International Conference on Web Search and Data Mining. New York： ACM， 2011： 815-824. 10.1145/1935826.1935932
9	CHEN L C， PAPANDREOU G， KOKKINOS， et al. DeepLab： Semantic image segmentation with deep convolutional nets， atrous convolution， and fully connected crfs ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence.2018，40（4）： 834-848. 10.1109/tpami.2017.2699184
10	MA Z， ZHOU S， WU X， et al. Nasopharyngeal carcinoma segmentation based on enhanced convolutional neural networks using multi-modal metric learning［J］. Physics in Medicine and Biology， 2019， 64（2）： No.025005. 10.1088/1361-6560/aaf5da
11	LUAN Y， EISENSTEIN J， TOUTANOVA K，et al. Sparse， dense，and attentional representations for text retrieval［J］. Transactions of the Association for Computational Linguistics，2021，9： 329-345. 10.1162/tacl_a_00369
12	MA L， JUEFEI X F， XUE M H， et al. DeepCT： tomographic combinatorial testing for deep learning systems［C］// Proceedings of the 2019 IEEE International Conference on Software Analysis， Evolution and Reengineering. Piscataway： IEEE， 2019： 614-618. 10.1109/saner.2019.8668044
13	MAZO C， BERNAL J， TRUJILLO M， et al. Transfer learning for classification of cardiovascular tissues in histological images［J］. Computer Methods and Programs in Biomedicine， 2018， 165： 69-76. 10.1016/j.cmpb.2018.08.006
14	PEINELT N， NGUYEN D， LIAKATA M. BERT： topic models and BERT joining forces for semantic similarity detection［C］//Proceedings of the 58th Annual Meeting of the As-sociation for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020：7047-7055. 10.18653/v1/2020.acl-main.630
15	YAN X， GUO J， LAN Y， et al. A biterm topic model for short texts［C］// Proceedings of the 22nd International Conference on World Wide Web. New York： ACM， 2013：1445-1455. 10.1145/2488388.2488514
16	DEMNER F D， KOHLI M D， ROSENMAN M B， et al. Preparing a collection of radiology examinations for distribution and retrieval［J］. Journal of the American Medical Informatics Association， 2016， 23（2）： 304-310. 10.1093/jamia/ocv080

名称	配置
操作系统	Windows 11
CPU	Intel Xeon 2.20 GHz
GPU	Tesla K80
CUDA	11.4
Python	3.7
PyTorch	1.11.0

名称	配置
操作系统	Windows 11
CPU	Intel Xeon 2.20 GHz
GPU	Tesla K80
CUDA	11.4
Python	3.7
PyTorch	1.11.0

类型	准确率	召回率	F1	n_su
No Finding	0.94	0.93	0.94	151
Cardiomegaly	0.90	0.85	0.88	41
Pneumonia	0.50	0.25	0.33	4
Widened Mediastinum	0.91	0.71	0.80	14
Lung Opacity	0.88	0.94	0.91	53
Lung Lesion	0.69	0.82	0.75	11
Pleural Effusion	0.00	0.00	0.00	2
Support Devices	0.88	0.93	0.90	15
Fracture	0.57	0.80	0.67	5
Pleural Other	1.00	0.60	0.75	5
Consolidation	0.50	0.50	0.50	4
Atelectasis	0.75	1.00	0.86	12
Pneumothorax	0.00	0.00	0.00	0
Edema	0.00	0.00	0.00	1
Micro Avg	0.88	0.88	0.88	318
Macro Avg	0.61	0.60	0.59	318
Weighted Avg	0.88	0.88	0.88	318
Samples Avg	0.90	0.90	0.89	318

类型	准确率	召回率	F1	n_su
No Finding	0.94	0.93	0.94	151
Cardiomegaly	0.90	0.85	0.88	41
Pneumonia	0.50	0.25	0.33	4
Widened Mediastinum	0.91	0.71	0.80	14
Lung Opacity	0.88	0.94	0.91	53
Lung Lesion	0.69	0.82	0.75	11
Pleural Effusion	0.00	0.00	0.00	2
Support Devices	0.88	0.93	0.90	15
Fracture	0.57	0.80	0.67	5
Pleural Other	1.00	0.60	0.75	5
Consolidation	0.50	0.50	0.50	4
Atelectasis	0.75	1.00	0.86	12
Pneumothorax	0.00	0.00	0.00	0
Edema	0.00	0.00	0.00	1
Micro Avg	0.88	0.88	0.88	318
Macro Avg	0.61	0.60	0.59	318
Weighted Avg	0.88	0.88	0.88	318
Samples Avg	0.90	0.90	0.89	318

模型	宏平均F1	微平均F1	训练时间/min
ITMMB	0.778 87	0.905 75	102
单纯图像分类模型	0.124 68	0.518 15	15
单纯文本分类模型	0.758 31	0.859 14	51