Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 73-78.DOI: 10.11772/j.issn.1001-9081.2022121910

• Cross-media representation learning and cognitive reasoning • Previous Articles     Next Articles

Product summarization extraction model with multimodal information fusion

Qiang ZHAO, Zhongqing WANG, Hongling WANG()   

  1. School of Computer Science and Technology,Soochow University,Suzhou Jiangsu 215006,China
  • Received:2023-01-09 Revised:2023-04-25 Accepted:2023-04-25 Online:2024-01-24 Published:2024-01-10
  • Contact: Hongling WANG
  • About author:ZHAO Qiang, born in 1996, M. S. candidate. His research interests include multimodal summarization.
    WANG Zhongqing, born in 1987, Ph. D., associate professor. His research interests include sentiment analysis.
  • Supported by:
    National Natural Science Foundation of China(61976146)

融合多模态信息的产品摘要抽取模型

赵强, 王中卿, 王红玲()   

  1. 苏州大学 计算机科学与技术学院,江苏 苏州 215006
  • 通讯作者: 王红玲
  • 作者简介:赵强(1996—),男,安徽宿州人,硕士研究生,主要研究方向:多模态摘要;
    王中卿(1987—),男,江苏苏州人,副教授,博士,主要研究方向:情感分析;
    第一联系人:王红玲(1975—),女,江苏苏州人,副教授,博士,主要研究方向:文本摘要。
  • 基金资助:
    国家自然科学基金资助项目(61976146)

Abstract:

On online shopping platforms, concise, authentic and effective product summarizations are crucial to improving the shopping experience. In addition, online shopping cannot touch the actual product, and the information contained in the product image is important visual information except the product text description, so product summarization that fuses multimodal information including product text and product image is of great significance for online shopping. Aiming at fusing product text descriptions and product images, a product summarization extraction model with multimodal information fusion was proposed. Different from the general product summarization task whose input only contains the product text description, the proposed model introduces product image as an additional source of information to make the extracted summary richer. Specifically, first the pre-trained model was used to represent the features of the product text description and product image by which the text feature representation of each sentence was extracted from the product text description, and the overall visual feature representation of the product was extracted from the product image. Then the low-rank tensor-based multimodal fusion method was used to modally fuse the text features and overall visual features to obtain the multimodal feature representation for each sentence. Finally, the multimodal feature representations of all sentences were fed into the summary generator to generate the final product summarization. Comparative experiments were conducted on CEPSUM 2.0 (Chinese E-commerce Product SUMmarization 2.0) dataset. On the three subsets of CEPSUM 2.0, the average ROUGE-1 (Recall-Oriented Understudy for Gisting Evaluation 1) of this model is 3.12 percentage points higher than that of TextRank and 1.75 percentage points higher than that of BERTSUMExt (BERT SUMmarization Extractive). Experimental results show that the proposed model is effective in fusing product text and image information, which performs well on ROUGE evaluation index.

Key words: product summarization, multimodal summarization, extraction summarization, multimodal fusion, automatic summarization

摘要:

在网络购物平台上,简洁、真实、有效的产品摘要对于提升购物体验至关重要。网上购物无法接触到产品实物,产品图像所含信息是除产品文本描述外的重要视觉信息,因此融合包括产品文本和产品图像在内的多模态信息的产品摘要对于网络购物具有重要的意义。针对融合产品文本描述和产品图像的问题,提出一种融合多模态信息的产品摘要抽取模型。与一般的产品摘要任务的输入只包含产品文本描述不同,该模型引入了产品图像作为一种额外的信息来源,使抽取产生的摘要更丰富。具体来说,首先对产品文本描述和产品图像分别使用预训练模型进行特征表示,从产品文本描述中提取每个句子的文本特征表示,从产品图像中提取产品整体的视觉特征表示;然后使用基于低阶张量的多模态融合方法将每个句子的文本特征和整体视觉特征进行模态融合,得到每个句子的多模态特征表示;最后将所有句子的多模态特征表示输入摘要生成器中以生成最终的产品摘要。在CEPSUM (Chinese E-commerce Product SUMmarization) 2.0数据集上进行对比实验,在CEPSUM 2.0的3个数据子集上,该模型的平均ROUGE-1比TextRank高3.12个百分点,比BERTSUMExt (BERT SUMmarization Extractive)高1.75个百分点。实验结果表明,该模型融合产品文本和图像信息对于产品摘要是有效的,在ROUGE评价指标上表现良好。

关键词: 产品摘要, 多模态摘要, 抽取式摘要, 多模态融合, 自动文摘

CLC Number: