Multi-modal summarization model based on semantic relevance analysis

doi:10.11772/j.issn.1001-9081.2022101527

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 65-72.DOI: 10.11772/j.issn.1001-9081.2022101527

• Cross-media representation learning and cognitive reasoning • Previous Articles Next Articles

Multi-modal summarization model based on semantic relevance analysis

Yuxiang LIN¹^,², Yunbing WU¹^,²(), Aiying YIN³, Xiangwen LIAO¹^,²

^1.College of Computer and Data Science，Fuzhou University，Fuzhou Fujian 350108，China
^2.Digital Fujian Institute of Financial Big Data，Fuzhou Fujian 350108，China
^3.Department of Computer Engineering，Zhicheng College of Fuzhou University，Fuzhou Fujian 350002，China

Received:2022-10-14 Revised:2023-02-08 Accepted:2023-02-14 Online:2023-04-12 Published:2024-01-10
Contact: Yunbing WU
About author:LIN Yuxiang， born in 1998， M. S. candidate. His research interests include multimodal summarization， natural language processing.
YIN Aiying， born in 1976， M. S.， lecturer. Her research interests include data mining， text retrieval.
LIAO Xiangwen， born in 1980， Ph. D.， professor. His research interests include opinion mining， sentiment analysis， natural language processing.
Supported by:
National Natural Science Foundation of China(61976054);Natural Science Foundation of Fujian Province(2022J01116)

基于语义相关性分析的多模态摘要模型

林于翔¹^,², 吴运兵¹^,²(), 阴爱英³, 廖祥文¹^,²

^1.福州大学计算机与大数据学院, 福州 350108
^2.数字福建金融大数据研究所, 福州 350108
^3.福州大学至诚学院计算机工程系, 福州 350002

通讯作者: 吴运兵
作者简介:林于翔（1998—），男，福建平潭人，硕士研究生，主要研究方向：多模态摘要、自然语言处理；
阴爱英（1976—），女，山西运城人，讲师，硕士，主要研究方向：数据挖掘、文本检索；
廖祥文（1980—），男，福建泉州人，教授，博士生导师，博士，主要研究方向：观点挖掘、情感分析、自然语言处理。
第一联系人：吴运兵（1976—），男，福建平潭人，副教授，硕士，主要研究方向：知识表示与知识发现、自然语言处理；
基金资助:
国家自然科学基金资助项目(61976054);福建省自然科学基金资助项目(2022J01116)

Abstract

Abstract:

Multi-modal abstractive summarization is commonly based on the Sequence-to-Sequence （Seq2Seq） framework， and the objective function optimizes the model at the character level， which searches locally optimal results to generate words and ignores the global semantic information of the summary samples. It may cause a problem of semantic deviation between the summary and multimodal information， resulting in factual errors. In order to solve the above problems， a multi-modal summarization model based on semantic relevance analysis was proposed. Firstly， the summary generator based on Seq2Seq framework was trained to generate candidate summaries with semantic multiplicity. Secondly， a summary evaluator based on semantic relevance analysis was applied to learn the semantic differences among candidate summaries and the evaluation mode of ROUGE （Recall-Oriented Understudy for Gisting Evaluation） from a global perspective， so that the model could be optimized at the level of summary samples. Finally， the summary evaluator was used to carry out reference-free evaluation of the candidate summaries， making the finally selected summary sample as similar as possible to the source text in semantic space. Experiments on benchmark dataset MMSS show that the proposed model can improve the evaluation indexes of ROUGE-1， ROUGE-2 and ROUGE-L by 3.17， 1.21 and 2.24 percentage points respectively compared with the current optimal MPMSE （Multimodal Pointer-generator via Multimodal Selective Encoding） model.

Key words: multi-modal, abstractive summarization, Sequence-to-Sequence (Seq2Seq), factual error, semantic relevance

摘要：

多模态生成式摘要往往采用序列到序列（Seq2Seq）框架，目标函数在字符级别优化模型，根据局部最优解生成单词，忽略了摘要样本全局语义信息，使得摘要与多模态信息产生语义偏差，容易造成事实性错误。针对上述问题，提出一种基于语义相关性分析的多模态摘要模型。首先，在Seq2Seq框架基础上对多模态摘要进行训练，生成语义多样性的候选摘要；其次，构建基于语义相关性分析的摘要评估器，从全局的角度学习候选摘要之间的语义差异性和真实评价指标ROUGE （Recall-Oriented Understudy for Gisting Evaluation）的排序模式，从而在摘要样本层面优化模型；最后，不依赖参考摘要，利用摘要评估器对候选摘要进行评价，使得选出的摘要与源文本在语义空间中尽可能相似。实验结果表明，在公开数据集MMSS上，相较于MPMSE （Multimodal Pointer-generator via Multimodal Selective Encoding）模型，所提模型在ROUGE-1、ROUGE-2、ROUGE-L评价指标上分别提升了3.17、1.21和2.24个百分点。

关键词: 多模态, 生成式摘要, 序列到序列, 事实性错误, 语义相关性

CLC Number:

TP391.1

Yuxiang LIN, Yunbing WU, Aiying YIN, Xiangwen LIAO. Multi-modal summarization model based on semantic relevance analysis[J]. Journal of Computer Applications, 2024, 44(1): 65-72.

林于翔, 吴运兵, 阴爱英, 廖祥文. 基于语义相关性分析的多模态摘要模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 65-72.

Figures/Tables 17

Fig. 1 Example of factual error of summary

Fig. 2 Flow of multi-modal summarization model based on semantic relevance analysis

Fig. 3 Multi-modal summarization model based on semantic relevance analysis

Fig. 4 Visual selective gated network

Fig. 5 Multi-modal selective gated network

Fig. 6 Principle of multimodal information fusion layer

Fig. 7 Flow of decoding

Fig. 8 Summary evaluator

Tab. 1 Information of MMSS dataset

数据集类别	句子-标题数	图片数
训练集	62 000	62 000
验证集	2 000	2 000
测试集	2 000	2 000

Tab. 2 Experimental parameter settings of summary generation module

参数名称	值	参数名称	值
隐藏状态维度	512	初始学习率	0.000 5
词嵌入维度	300	学习率衰减率	0.5
batch_size	8	Dropout	0.2
集束搜索的束宽大小	16	梯度裁剪	2.0

Tab. 3 Experimental parameter settings of summary evaluation module

参数名称	值	参数名称	值
batch_size	8	warmup steps	1 000
num_epoch	8	max_lr	0.002

Tab. 4 Experimental results on MMSS dataset

模型	ROUGE-1	ROUGE-2	ROUGE-L
Compress^［24］	31.56	11.02	28.87
ABS^［5］	35.95	18.21	31.89
SEASS^［25］	44.86	23.03	41.92
PGNet^［9］	46.05	24.18	44.16
MAtt^［2］	45.78	23.45	43.16
MPID^［23］	48.11	24.70	44.96
MPMSE^［3］	48.19	25.64	45.27
本文模型	51.36	26.85	47.51

Tab. 5 Influence of removing different modules on experimental results

模型	ROUGE-1	ROUGE-2	ROUGE-L
本文模型	51.36	26.85	47.51
w/o $s G a t e i i m g$	50.40	26.12	46.68
w/o $f 1 ()$	48.79	25.94	45.72

Tab. 5 Influence of removing different modules on experimental results

模型	ROUGE-1	ROUGE-2	ROUGE-L
本文模型	51.36	26.85	47.51
w/o $s G a t e i i m g$	50.40	26.12	46.68
w/o $f 1 ()$	48.79	25.94	45.72

Tab. 6 Experimental results of different summary evaluators

模型	ROUGE-1	ROUGE-2	ROUGE-L
无摘要评估器	48.79	25.94	45.72
摘要评估器 $f 1 ()$	51.36	26.85	47.51
摘要评估器 $f 2 ()$	49.86	26.14	46.54

Tab. 6 Experimental results of different summary evaluators

模型	ROUGE-1	ROUGE-2	ROUGE-L
无摘要评估器	48.79	25.94	45.72
摘要评估器 $f 1 ()$	51.36	26.85	47.51
摘要评估器 $f 2 ()$	49.86	26.14	46.54

Tab. 7 Influence of removing visual global information of different modules on experimental results

模型	ROUGE-1	ROUGE-2	ROUGE-L
本文模型	51.36	26.85	47.51
$g w / o s G a t e i i m g$	50.40	26.12	46.68
$g w / o s G a t e i m m$	50.75	26.16	46.76
$g w / o L S T M$	50.46	25.72	46.32

Tab. 7 Influence of removing visual global information of different modules on experimental results

模型	ROUGE-1	ROUGE-2	ROUGE-L
本文模型	51.36	26.85	47.51
$g w / o s G a t e i i m g$	50.40	26.12	46.68
$g w / o s G a t e i m m$	50.75	26.16	46.76
$g w / o L S T M$	50.46	25.72	46.32

Fig. 9 Influence of number of candidate summaries on experimental results

Fig. 10 Comparison of original and generated summaries

References 25

1	SOLEYMANI M， GARCIA D， JOU B， et al. A survey of multimodal sentiment analysis ［J］. Image and Vision Computing， 2017， 65（9）： 3-14. 10.1016/j.imavis.2017.08.003
2	LI H， ZHU J， LIU T， et al. Multi-modal sentence summarization with modality attention and image filtering ［C］// Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2018： 4152-4158. 10.24963/ijcai.2018/577
3	LI H， ZHU J， ZHANG J， et al. Multimodal sentence summarization via multimodal selective encoding ［C］// Proceedings of the 28th International Conference on Computational Linguistics. ［S.l.］： International Committee on Computational Linguistics， 2020： 5655-5667. 10.18653/v1/2020.coling-main.496
4	MIHALCEA R， TARAU P. TextRank： Bringing order into text ［C］// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2004： 404-411. 10.3115/1220355.1220517
5	RUSH A M， CHOPRA S， WESTON J. A neural attention model for abstractive sentence summarization ［C］// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2015： 379-389. 10.18653/v1/d15-1044
6	CHOPRA S， AULI M， RUSH A M. Abstractive sentence summarization with attentive recurrent neural networks ［C］// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： Association for Computational Linguistics， 2016： 93-98. 10.18653/v1/n16-1012
7	NALLAPATI R， ZHOU B， SANTOS C D， et al. Abstractive text summarization using sequence-to-sequence RNNs and beyond ［C］// Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Stroudsburg， PA： Association for Computational Linguistics， 2016： 280-290. 10.18653/v1/k16-1028
8	GU J， LU Z， LI H， et al. Incorporating copying mechanism in sequence-to-sequence learning ［C］// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2016： 1631-1640. 10.18653/v1/p16-1154
9	SEE A， LIU P J， MANNING C D. Get to the point： Summarization with pointer-generator networks ［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2017： 1073-1083. 10.18653/v1/p17-1099
10	ZHU J， LI H， LIU T， et al. MSMO： Multimodal summarization with multimodal output ［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2018： 4154-4164. 10.18653/v1/d18-1448
11	YE X， YUE Z， LIU R. MBA： A multimodal bilinear attention model with residual connection for abstractive multimodal summarization ［J］. Journal of Physics： Conference Series， 2021， 1856： 012070. 10.1088/1742-6596/1856/1/012070
12	ZHANG Z， WANG J， SUN Z， et al. LAMS： A location-aware approach for multimodal summarization ［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2021， 35（18）： 15949-15950. 10.1609/aaai.v35i18.17971
13	LIU Y， LIU P. SimCLS： A simple framework for contrastive learning of abstractive summarization ［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 2： Short Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2021： 1065-1072. 10.18653/v1/2021.acl-short.135
14	PAULUS R， XIONG C， SOCHER R. A deep reinforced model for abstractive summarization ［EB/OL］. ［2022-10-01］. .
15	LI S， LEI D， QIN P， et al. Deep reinforcement learning with distributional semantic rewards for abstractive summarization ［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2019： 6038-6044. 10.18653/v1/d19-1623
16	SHEN W， GONG Y， SHEN Y， et al. Joint generator-ranker learning for natural language generation ［EB/OL］. （2022-10-19）［2023-02-06］. . 10.18653/v1/2023.findings-acl.486
17	PAN H， LIN Z， FU P， et al. Modeling intra and inter-modality incongruity for multi-modal sarcasm detection ［C］// Findings of the Association for Computational Linguistics： EMNLP 2020. Stroudsburg， PA： Association for Computational Linguistics， 2020： 1383-1392. 10.18653/v1/2020.findings-emnlp.124
18	SCHUSTER M， PALIWAL K K. Bidirectional recurrent neural networks ［J］. IEEE Transactions on Signal Processing， 1997， 45（11）： 2673-2681. 10.1109/78.650093
19	BAHDANAU D， CHO K， BENGIO Y. Neural machine translation by jointly learning to align and translate ［EB/OL］. ［2022-10-01］. . 10.1017/9781108608480.003
20	HOCHREITER S， SCHMIDHUBER J. Long short-term memory ［J］. Neural Computation， 1997， 9（8）： 1735-1780. 10.1162/neco.1997.9.8.1735
21	LIU Y， OTT M， GOYAL N， et al. RoBERTa： a robustly optimized BERT pretraining approach ［EB/OL］. （2019-06-26）［2023-02-06］. .
22	蔡中祥，孙建伟.融合指针网络的新闻文本摘要模型［J］.小型微型计算机系统， 2021， 42（3）： 462-466. 10.3969/j.issn.1000-1220.2021.03.003
	CAI Z X， SUN J W. News text summarization model integrating pointer network ［J］. Journal of Chinese Computer Systems， 2021， 42（3）： 462-466. 10.3969/j.issn.1000-1220.2021.03.003
23	LI H， YUAN P， XU S， et al. Aspect-aware multimodal summarization for Chinese e-commerce products ［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2020， 34（5）： 8188-8195. 10.1609/aaai.v34i05.6332
24	CLARKE J， LAPATA M. Global inference for sentence compression： An integer linear programming approach ［J］. Journal of Artificial Intelligence Research， 2008， 31（1）： 399-429. 10.1613/jair.2433
25	ZHOU Q， YANG N， WEI F， et al. Selective encoding for abstractive sentence summarization ［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2017： 1095-1104. 10.18653/v1/p17-1101

[1]	Ying HUANG, Jiayu YANG, Jiahao JIN, Bangrui WAN. Siamese mixed information fusion algorithm for RGBT tracking [J]. Journal of Computer Applications, 2024, 44(9): 2878-2885.
[2]	Rui ZHANG, Pengyun ZHANG, Meirong GAO. Self-optimized dual-modal multi-channel non-deep vestibular schwannoma recognition model [J]. Journal of Computer Applications, 2024, 44(9): 2975-2982.
[3]	Zexin XU, Lei YANG, Kangshun LI. Shorter long-sequence time series forecasting model [J]. Journal of Computer Applications, 2024, 44(6): 1824-1831.
[4]	Yirui HUANG, Junwei LUO, Jingqiang CHEN. Multi-modal dialog reply retrieval based on contrast learning and GIF tag [J]. Journal of Computer Applications, 2024, 44(1): 32-38.
[5]	Jiaming HE, Jucheng YANG, Chao WU, Xiaoning YAN, Nenghua XU. Person re-identification method based on multi-modal graph convolutional neural network [J]. Journal of Computer Applications, 2023, 43(7): 2182-2189.
[6]	Meng DOU, Zhebin CHEN, Xin WANG, Jitao ZHOU, Yu YAO. Review of multi-modal medical image segmentation based on deep learning [J]. Journal of Computer Applications, 2023, 43(11): 3385-3395.
[7]	Na YU, Yan LIU, Xiongju WEI, Yuan WAN. Semantic segmentation of RGB-D indoor scenes based on attention mechanism and pyramid fusion [J]. Journal of Computer Applications, 2022, 42(3): 844-853.
[8]	Jie MENG, Li WANG, Yanjie YANG, Biao LIAN. Multi-modal deep fusion for false information detection [J]. Journal of Computer Applications, 2022, 42(2): 419-425.
[9]	DONG Yang, PAN Haiwei, CUI Qianna, BIAN Xiaofei, TENG Teng, WANG Bangju. Few-shot segmentation method for multi-modal magnetic resonance images of brain tumor [J]. Journal of Computer Applications, 2021, 41(4): 1049-1054.
[10]	WU Rui, LIU Yu, FENG Kai. Pedestrian attribute recognition based on two-domain self-attention mechanism [J]. Journal of Computer Applications, 2021, 41(2): 372-378.
[11]	Wei CHEN, Yan YANG. Extractive and abstractive summarization model based on pointer-generator network [J]. Journal of Computer Applications, 2021, 41(12): 3527-3533.
[12]	FU Ying, WANG Hongling, WANG Zhongqing. Scientific paper summarization model using macro discourse structure [J]. Journal of Computer Applications, 2021, 41(10): 2864-2870.
[13]	TAN Jinyuan, DIAO Yufeng, QI Ruihua, LIN Hongfei. Automatic summary generation of Chinese news text based on BERT-PGN model [J]. Journal of Computer Applications, 2021, 41(1): 127-132.
[14]	CHEN Hao, QIN Zhiguang, DING Yi. Multi-modal brain tumor segmentation method under same feature space [J]. Journal of Computer Applications, 2020, 40(7): 2104-2109.
[15]	PI Jiatian, YANG Jiezhi, YANG Linxi, PENG Mingjie, DENG Xiong, ZHAO Lijun, TANG Wanmei, WU Zhiyou. Lightweight face liveness detection method based on multi-modal feature fusion [J]. Journal of Computer Applications, 2020, 40(12): 3658-3665.

Multi-modal summarization model based on semantic relevance analysis

基于语义相关性分析的多模态摘要模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 17

References 25

Related Articles 15

Recommended Articles

Metrics