《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (4): 991-1004.DOI: 10.11772/j.issn.1001-9081.2022020296

• 人工智能 •    下一篇

多模态预训练模型综述

王惠茹1, 李秀红1(), 李哲2, 马春明1, 任泽裕1, 杨丹1   

  1. 1.新疆大学 信息科学与工程学院,乌鲁木齐 830046
    2.香港理工大学 电子及资讯工程学系,香港 999077
  • 收稿日期:2022-03-16 修回日期:2022-06-06 接受日期:2022-06-07 发布日期:2022-08-16 出版日期:2023-04-10
  • 通讯作者: 李秀红
  • 作者简介:王惠茹(1996—),女,硕士研究生,新疆伊犁人,硕士研究生,主要研究方向:自然语言处理、图像处理;
    李哲(1992—),男,山东泰安人,博士研究生,主要研究方向:多模态说话人识别、鲁棒性机器学习;
    马春明(1997—),男,四川绵阳人,硕士研究生,主要研究方向:自然语言处理、事件抽取;
    任泽裕(1998—),男,山西长治人,硕士研究生,主要研究方向:语音识别、图像处理;
    杨丹(1996—),女,四川南充人,硕士研究生,主要研究方向:自然语言处理、图像处理。
  • 基金资助:
    国家语委重点研发项目(ZDI135?96)

Survey of multimodal pre-training models

Huiru WANG1, Xiuhong LI1(), Zhe LI2, Chunming MA1, Zeyu REN1, Dan YANG1   

  1. 1.School of Information Science and Engineering,Xinjiang University,Urumqi Xinjiang 830046,China
    2.Department of Electronic and Information Engineering,The Hong Kong Polytechnic University,Hong Kong 999077,China
  • Received:2022-03-16 Revised:2022-06-06 Accepted:2022-06-07 Online:2022-08-16 Published:2023-04-10
  • Contact: Xiuhong LI
  • About author:WANG Huiru, born in 1996, M. S. candidate. Her research interests include natural language processing, image processing.
    LI Zhe, born in 1992, Ph. D. candidate. His research interests include multimodal speaker recognition, robust machine learning.
    MA Chunming, born in 1997, M. S. candidate. His research interests include natural language processing, event extraction.
    REN Zeyu, born in 1998, M. S. candidate. His research interests include speech recognition, image processing.
    YANG Dan, born in 1996, M. S. candidate. Her research interests include natural language processing, image processing.
  • Supported by:
    National Language Commission Key Project(ZDI135-96)

摘要:

预训练模型(PTM)通过利用复杂的预训练目标和大量的模型参数,可以有效地获得无标记数据中的丰富知识。而在多模态中,PTM的发展还处于初期。根据具体模态的不同,将目前大多数的多模态PTM分为图像?文本PTM和视频?文本PTM;根据数据融合方式的不同,还可将多模态PTM分为单流模型和双流模型两类。首先,总结了常见的预训练任务和验证实验所使用的下游任务;接着,梳理了目前多模态预训练领域的常见模型,并用表格列出各个模型的下游任务以及模型的性能和实验数据比较;然后,介绍了M6(Multi-Modality to Multi-Modality Multitask Mega-transformer)模型、跨模态提示调优(CPT)模型、VideoBERT(Video Bidirectional Encoder Representations from Transformers)模型和AliceMind(Alibaba’s collection of encoder-decoders from Mind)模型在具体下游任务中的应用场景;最后,总结了多模态PTM相关工作面临的挑战以及未来可能的研究方向。

关键词: 多模态, 预训练模型, 图像?文本预训练模型, 视频?文本预训练模型, 神经网络, 单流模型, 双流模型

Abstract:

By using complex pre-training targets and a large number of model parameters, Pre-Training Model (PTM) can effectively obtain rich knowledge from unlabeled data. However, the development of the multimodal PTMs is still in its infancy. According to the difference between modals, most of the current multimodal PTMs were divided into the image-text PTMs and video-text PTMs. According to the different data fusion methods, the multimodal PTMs were divided into two types: single-stream models and two-stream models. Firstly, common pre-training tasks and downstream tasks used in validation experiments were summarized. Secondly, the common models in the area of multimodal pre-training were sorted out, and the downstream tasks of each model and the performance and experimental data of the models were listed in tables for comparison. Thirdly, the application scenarios of M6 (Multi-Modality to Multi-Modality Multitask Mega-transformer) model, Cross-modal Prompt Tuning (CPT) model, VideoBERT (Video Bidirectional Encoder Representations from Transformers) model, and AliceMind (Alibaba’s collection of encoder-decoders from Mind) model in specific downstream tasks were introduced. Finally, the challenges and future research directions faced by related multimodal PTM work were summed up.

Key words: multimodal, Pre-Training Model (PTM), image-text pre-training model, video-text pre-training model, neural network, single-stream model, two-stream model

中图分类号: