Dynamic evaluation method for benefit of modality augmentation

doi:10.11772/j.issn.1001-9081.2022101510

Abstract

Abstract:

Focused on the difficulty and big benefit difference in acquiring new modalities， a method for dynamically evaluating benefit of modality augmentation was proposed. Firstly， the intermediate feature representation and the prediction results before and after modality fusion were obtained through the multimodal fusion network. Then， the confidence before and after fusion were obtained by introducing the True Class Probability （TCP） of two prediction results to confidence estimation. Finally， the difference between two confidences was calculated and used as an sample to obtain the benefit brought by the new modality. Extensive experiments were conducted on commonly used multimodal datasets and real medical datasets such as The Cancer Genome Atlas （TCGA）. The experimental results on TCGA dataset show that compared with the random benefit evaluation method and the Maximum Class Probability （MCP） based method， the proposed method has the accuracy increased by 1.73 to 4.93 and 0.43 to 4.76 percentage points respectively， and the Effective Sample Rate （ESR） increased by 2.72 to 11.26 and 1.08 to 25.97 percentage points respectively. It can be seen that the proposed method can effectively evaluate benefits of acquiring new modalities for different samples， and has a certain degree of interpretability.

Key words: multimodal classification, multimodal fusion, confidence estimation, modality augmentation, representation learning

摘要：

针对获取新模态难度大、收益差异大的问题，提出了一种增广模态收益动态评估方法。首先，通过多模态融合网络得到中间特征表示和模态融合前后的预测结果；其次，将两个预测结果的真实类别概率（TCP）引入置信度估计，得到融合前后的置信度；最后，计算两种置信度的差异，并将该差异作为样本以获取新模态所带来的收益。在常用多模态数据集和真实的医学数据集如癌症基因组图谱（TCGA）上进行实验。在TCGA数据集上的实验结果表明，与随机收益评估方法和基于最大类别概率（MCP）的方法相比，所提方法的准确率分别提高了1.73~4.93和0.43~4.76个百分点，有效样本率（ESR）分别提升了2.72~11.26和1.08~25.97个百分点。可见，所提方法能够有效评估不同样本获取新模态所带来的收益，并具备一定可解释性。

关键词: 多模态分类, 多模态融合, 置信度估计, 增广模态, 表示学习

CLC Number:

TP391.4

Yizhen BI, Huan MA, Changqing ZHANG. Dynamic evaluation method for benefit of modality augmentation[J]. Journal of Computer Applications, 2023, 43(10): 3099-3106.

毕以镇, 马焕, 张长青. 增广模态收益动态评估方法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3099-3106.

Figures/Tables 14

Fig. 1 Main framework of multimodal confidence estimation

Fig. 2 Frameworks of two multimodal fusion methods

Fig. 3 Tissue section images

Tab. 1 Description of datasets

数据集	维度		类别数
数据集	模态1	模态2	类别数
hand	216	76	10
CMU-MOSEI	50×300	50×35	7
Dermatology	11	23	6
TCGA	64×64×3	80	3

Tab. 2 Accuracy comparison between unimodal and multimodal

数据集	模态1	模态2	融合后
hand	97.41±0.31	74.91±1.85	98.41±0.11
CMU-MOSEI	50.25±0.14	41.88±0.27	50.37±0.13
Dermatology	79.33±1.69	94.33±0.94	95.33±1.69
TCGA	47.73±2.68	61.87±0.62	62.74±1.08

Fig. 4 Change trend of MSE

Fig. 5 Schematic diagram of sample sorting

Tab. 3 Description of modality missing

数据集	模态1	模态2	测试集大小
hand	×	√	400
CMU-MOSEI	×	√	4 643
Dermatology	√	×	100
TCGA	√	×	231

Fig. 6 Comparison of accuracy on different datasets

Fig. 7 Comparison of effective sample rate on different datasets

Fig. 8 Accuracy comparison between the proposed method and simple method

Fig. 9 Comparison of effective sample rate between the proposed method and simple method

Tab. 4 Accuracy comparison between weighted fusion and average fusion

数据集	平均融合	加权融合
hand	96.91±0.82	98.41±0.11
CMU-MOSEI	48.69±1.03	50.37±0.13
Dermatology	92.66±0.47	95.33±1.69
TCGA	59.71±2.31	62.74±1.08

Tab. 5 Training result of α

数据集	模态1的自适应权重（ $α 1$ ）	模态2的自适应权重（ $α 2$ ）
hand	0.624 5	0.375 5
CMU-MOSEI	0.897 0	0.103 0
Dermatology	0.431 8	0.568 2
TCGA	0.390 0	0.610 0

Tab. 5 Training result of α

数据集	模态1的自适应权重（ $α 1$ ）	模态2的自适应权重（ $α 2$ ）
hand	0.624 5	0.375 5
CMU-MOSEI	0.897 0	0.103 0
Dermatology	0.431 8	0.568 2
TCGA	0.390 0	0.610 0

References 34

1	RAMACHANDRAM D， TAYLOR G W. Deep multimodal learning： a survey on recent advances and trends［J］. IEEE Signal Processing Magazine， 2017， 34（6）：96-108. 10.1109/msp.2017.2738401
2	LEE S， PARK S J， HONG K S. RDFNet： RGB-D multi-level residual feature fusion for indoor semantic segmentation［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 4990-4999. 10.1109/iccv.2017.533
3	VALADA A， MOHAN R， BURGARD W. Self-supervised model adaptation for multimodal semantic segmentation［J］. International Journal of Computer Vision， 2020， 128（5）： 1239-1285. 10.1007/s11263-019-01188-y
4	FAN L， HUANG W， GAN C， et al. End-to-end learning of motion representation for video understanding［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6016-6025. 10.1109/cvpr.2018.00630
5	GARCIA N C， MORERIO P， MURINO V. Modality distillation with multiple stream networks for action recognition［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11212. Cham： Springer， 2018： 106-121.
6	BALNTAS V， DOUMANOGLOU A， SAHIN C， et al. Pose guided RGBD feature learning for 3D object pose estimation［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 3876-3884. 10.1109/iccv.2017.416
7	吴明晖，张广洁，金苍宏. 基于多模态信息融合的时间序列预测模型［J］. 计算机应用， 2022， 42（8）： 2326-2332. 10.11772/j.issn.1001-9081.2021061053
	WU M H， ZHANG G J， JIN C H. Time series prediction model based on multimodal information fusion［J］. Journal of Computer Applications， 2022， 42（8）： 2326-2332. 10.11772/j.issn.1001-9081.2021061053
8	余娜，刘彦，魏雄炬，等. 基于注意力机制和金字塔融合的RGB-D室内场景语义分割［J］. 计算机应用， 2022， 42（3）： 844-853. 10.11772/j.issn.1001-9081.2021030392
	YU N， LIU Y， WEI X J， et al. Semantic segmentation of RGB-D indoor scenes based on attention mechanism and pyramid fusion［J］. Journal of Computer Applications， 2022， 42（3）： 844-853. 10.11772/j.issn.1001-9081.2021030392
9	WANG Y， HUANG W， SUN F， et al. Deep multimodal fusion by channel exchanging［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2020： 4835-4845.
10	HAZIRBAS C， MA L， DOMOKOS C， et al. FuseNet： incorporating depth into semantic segmentation via fusion-based cnn architecture［C］// Proceedings of the 2016 Asian Conference on Computer Vision， LNCS 10111. Cham： Springer， 2017： 213-228.
11	ZENG J， TONG Y， HUANG Y， et al. Deep surface normal estimation with hierarchical RGB-D fusion［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6146-6155. 10.1109/cvpr.2019.00631
12	DU D， WANG L， WANG H， et al. Translate-to-recognize networks for RGB-D scene recognition［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 11828-11837. 10.1109/cvpr.2019.01211
13	GRETTON A， BORGWARDT K M， RASCH M J， et al. A kernel two-sample test［J］. Journal of Machine Learning Research， 2012， 13： 723-773.
14	WANG J， WANG Z， TAO D， et al. Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9909. Cham： Springer， 2016： 664-679.
15	LIU Z， LI J， SHEN Z， et al. Learning efficient convolutional networks through network slimming［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2755-2763. 10.1109/iccv.2017.298
16	BALTRUŠAITIS T， AHUJA C， MORENCY L P. Multimodal machine learning： a survey and taxonomy［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2019， 41（2）： 423-443. 10.1109/tpami.2018.2798607
17	CASTELLANO G， KESSOUS L， CARIDAKIS G. Emotion recognition through multiple modalities： face， body gesture， speech［M］// PETER C， BEALE R. Affect and Emotion in Human-Computer Interaction： From Theory to Applications， LNCS 4868. Berlin： Springer， 2008： 92-103.
18	RAMIREZ G A， BALTRUŠAITIS T， MORENCY L P. Modeling latent discriminative dynamic of multi-dimensional affective signals［C］// Proceedings of the 2011 International Conference on Affective Computing and Intelligent Interaction， LNCS 6975. Berlin： Springer， 2011： 396-406.
19	LAN Z Z， BAO L， YU S I， et al. Multimedia classification and event detection using double fusion［J］. Multimedia Tools and Applications， 2014， 71（1）： 333-347. 10.1007/s11042-013-1391-2
20	CAI T， CAI T T， ZHANG A. Structured matrix completion with applications to genomic data integration［J］. Journal of the American Statistical Association， 2016， 111（514）： 621-633. 10.1080/01621459.2015.1021005
21	TRAN L， LIU X， ZHOU J， et al. Missing modalities imputation via cascaded residual autoencoder［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4971-4980. 10.1109/cvpr.2017.528
22	TSAI Y H H， LIANG P P， ZADEH A， et al. Learning factorized multimodal representations［EB/OL］. （2019-05-14）［2023-01-20］..
23	WU M， GOODMAN N. Multimodal generative models for scalable weakly-supervised learning［C］// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2018： 5580-5590.
24	ZHANG C， HAN Z， CUI Y， et al. CPM-Nets： cross partial multi-view networks［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2019： 559-569.
25	AMODEI D， OLAH C， STEINHARDT J， et al. Concrete problems in AI safety［EB/OL］. （2016-07-25）［2023-01-20］..
26	JANAI J， GÜNEY F， BEHL A， et al. Computer vision for autonomous vehicles： problems， datasets and state of the art［J］. Foundations and Trends^® in Computer Graphics and Vision， 2020， 12（1/2/3）： 1-308. 10.1561/0600000079
27	GUO C， PLEISS G， SUN Y， et al. On calibration of modern neural networks［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 1321-1330.
28	LIANG S， LI Y， SRIKANT R. Enhancing the reliability of out-of-distribution image detection in neural networks［EB/OL］. （2020-08-30）［2023-01-20］..
29	CORBIÈRE C， THOME N， BAR-HEN A， et al. Addressing failure prediction by learning model confidence［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2019： 2902-2913.
30	GAL Y， GHAHRAMANI Z. Dropout as a Bayesian approximation： representing model uncertainty in deep learning［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 1050-1059.
31	DUI R. Multiple Features dataset in UCI machine learning repository［DS/OL］. ［2023-01-20］..
32	ZADEH A A B， LIANG P P， PORIA S， et al. Multimodal language analysis in the wild： CMU-MOSEI dataset and interpretable dynamic fusion graph［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg， PA： ACL， 2018： 2236-2246. 10.18653/v1/p18-1208
33	CHEN R J， LU M Y， WANG J， et al. Pathomic fusion： an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis［J］. IEEE Transactions on Medical Imaging， 2022， 41（4）： 757-770. 10.1109/tmi.2020.3021387
34	ILTER N， GUVENIR H. Dermatology dataset in UCI machine learning repository［DS/OL］. ［2023-01-20］..

[1]	Yu DU, Yan ZHU. Constructing pre-trained dynamic graph neural network to predict disappearance of academic cooperation behavior [J]. Journal of Computer Applications, 2024, 44(9): 2726-2731.
[2]	Shunyong LI, Shiyi LI, Rui XU, Xingwang ZHAO. Incomplete multi-view clustering algorithm based on self-attention fusion [J]. Journal of Computer Applications, 2024, 44(9): 2696-2703.
[3]	Tingjie TANG, Jiajin HUANG, Jin QIN, Hui LU. Session-based recommendation based on graph co-occurrence enhanced multi-layer perceptron [J]. Journal of Computer Applications, 2024, 44(8): 2357-2364.
[4]	Shibin LI, Jun GONG, Shengjun TANG. Semi-supervised heterophilic graph representation learning model based on Graph Transformer [J]. Journal of Computer Applications, 2024, 44(6): 1816-1823.
[5]	Yunhua ZHU, Bing KONG, Lihua ZHOU, Hongmei CHEN, Chongming BAO. Multi-view clustering network guided by graph contrastive learning [J]. Journal of Computer Applications, 2024, 44(10): 3267-3274.
[6]	Junhao LUO, Yan ZHU. Multi-dynamic aware network for unaligned multimodal language sequence sentiment analysis [J]. Journal of Computer Applications, 2024, 44(1): 79-85.
[7]	Mu LI, Yuheng YANG, Xizheng KE. Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal [J]. Journal of Computer Applications, 2024, 44(1): 86-93.
[8]	Qiang ZHAO, Zhongqing WANG, Hongling WANG. Product summarization extraction model with multimodal information fusion [J]. Journal of Computer Applications, 2024, 44(1): 73-78.
[9]	Yirui HUANG, Junwei LUO, Jingqiang CHEN. Multi-modal dialog reply retrieval based on contrast learning and GIF tag [J]. Journal of Computer Applications, 2024, 44(1): 32-38.
[10]	Chunlei WANG, Xiao WANG, Kai LIU. Multimodal knowledge graph representation learning： a review [J]. Journal of Computer Applications, 2024, 44(1): 1-15.
[11]	Wei TONG, Liyang HE, Rui LI, Wei HUANG, Zhenya HUANG, Qi LIU. Efficient similar exercise retrieval model based on unsupervised semantic hashing [J]. Journal of Computer Applications, 2024, 44(1): 206-216.
[12]	Zelin XU, Min YANG, Meng CHEN. Point-of-interest category representation model with spatial and textual information [J]. Journal of Computer Applications, 2023, 43(8): 2456-2461.
[13]	Kun ZHANG, Fengyu YANG, Fa ZHONG, Guangdong ZENG, Shijian ZHOU. Source code vulnerability detection based on hybrid code representation [J]. Journal of Computer Applications, 2023, 43(8): 2517-2526.
[14]	Jinghong WANG, Zhixia ZHOU, Hui WANG, Haokang LI. Attribute network representation learning with dual auto-encoder [J]. Journal of Computer Applications, 2023, 43(8): 2338-2344.
[15]	Kun FU, Yuhan HAO, Minglei SUN, Yinghua LIU. Network representation learning based on autoencoder with optimized graph structure [J]. Journal of Computer Applications, 2023, 43(10): 3054-3061.