《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (2): 411-420.DOI: 10.11772/j.issn.1001-9081.2024010130

• 人工智能 • 上一篇    

基于选择状态空间的三模态适配器

刘弘业, 陈锡爱(), 曾涛   

  1. 中国计量大学 机电工程学院,杭州 310018
  • 收稿日期:2024-02-05 修回日期:2024-04-15 接受日期:2024-04-15 发布日期:2024-05-09 出版日期:2025-02-10
  • 通讯作者: 陈锡爱
  • 作者简介:刘弘业(1998—),男,浙江杭州人,硕士研究生,主要研究方向:文本检测、文本识别、多模态内容理解
    曾涛(1989—),男,浙江丽水人,讲师,博士,主要研究方向:机器学习、深度学习。
  • 基金资助:
    国家自然科学基金资助项目(52005472)

Tri-modal adapter based on selective state space

Hongye LIU, Xiai CHEN(), Tao ZENG   

  1. College of Mechanical and Electrical Engineering,China Jiliang University,Hangzhou Zhejiang 310018,China
  • Received:2024-02-05 Revised:2024-04-15 Accepted:2024-04-15 Online:2024-05-09 Published:2025-02-10
  • Contact: Xiai CHEN
  • About author:LIU Hongye, born in 1998, M. S candidate. His research interests include text detection, text recognition, multimodal content understanding.
    ZENG Tao, born in 1988, Ph. D., lecturer. His research interests include machine learning, deep learning.
  • Supported by:
    National Natural Science Foundation of China(52005472)

摘要:

预训练再微调范式广泛应用于各种单模态和多模态的任务中。然而,随着模型规模的指数级别增长,微调预训练模型的所有参数变得非常困难。为了解决这个问题,设计一种基于选择状态空间的三模态适配器,它可以冻结预训练模型,只针对少量额外的参数微调,并完成三模态间的密集交互。具体地,提出一个基于选择状态空间的长期语义选择模块和一个基于视觉或音频中心的短期语义交互模块,这两个模块被按顺序插入各顺序编码器之间,以完成三模态信息的密集交互。长期语义选择模块旨在抑制三模态中的冗余信息,短期语义交互模块则对短时间内的局部模态特征进行交互建模。与之前需要在大规模三模态数据集上进行预训练的方法相比,所提方法更灵活,它可以继承任意强大的单模态或双模态模型。在Music-AVQA三模态评测数据集上,所提方法取得了80.19%的平均准确率,较LAVISH提升了4.09个百分点。

关键词: 预训练再微调, 选择状态空间, 三模态, 长期语义, 短期语义

Abstract:

The pre-training-then-fine-tuning paradigm is widely used in a variety of unimodal and multimodal tasks. However, as the model size grows exponentially, it becomes very difficult to fine-tune all the parameters of the pre-trained model. To solve this problem, a tri-modal adapter based on selective state space was designed, which can freeze the pre-trained model, fine-tune only a small number of additional parameters, and accomplish intensive interactions among three modalities. Specifically, a long-term semantic selection module based on selective state space and a short-term semantic interaction module based on visual or audio center were proposed and inserted among the sequential encoders sequentially to accomplish the intensive interactions among tri-modal information. The long-term semantic selection module aims at suppressing redundant information in three modalities, while the short-term semantic interaction module models the interactions of local modal features in a short term. Compared to previous methods that require pre-training on large-scale tri-modal datasets, the proposed method is more flexible, and it can inherit powerful unimodal or bimodal models arbitrarily. On Music-AVQA tri-modal evaluation dataset, the proposed method achieves an average accuracy of 80.19%, with an improvement of 4.09 percentage points compared to LAVISH.

Key words: pre-training-then-fine-tuning, selective state space, tri-modal, long-term semantics, short-term semantics

中图分类号: