《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (4): 1269-1276.DOI: 10.11772/j.issn.1001-9081.2023040540

• 多媒体计算与计算机仿真 • 上一篇    

基于自监督视觉Transformer的图像美学质量评价方法

黄荣1,2, 宋俊杰1, 周树波1,2(), 刘浩1,2   

  1. 1.东华大学 信息科学与技术学院,上海 201620
    2.数字化纺织服装技术教育部工程研究中心(东华大学),上海 201620
  • 收稿日期:2023-05-08 修回日期:2023-06-29 接受日期:2023-07-13 发布日期:2023-12-04 出版日期:2024-04-10
  • 通讯作者: 周树波
  • 作者简介:黄荣(1985—),男,浙江绍兴人,副教授,博士,主要研究方向:深度学习、图像分析
    宋俊杰(1998—),男,山东淄博人,硕士研究生,主要研究方向:深度学习、图像分析
    周树波(1988—),男,浙江绍兴人,讲师,博士,主要研究方向:深度学习、图像分析 zhoushubo@dhu.edu.cn
    刘浩(1977—),男,四川达州人,副教授,博士,CCF会员,主要研究方向:深度学习、机器视觉。
  • 基金资助:
    国家自然科学基金资助项目(62001099);中央高校基本科研业务费专项资金资助项目(2232023D?30)

Image aesthetic quality evaluation method based on self-supervised vision Transformer

Rong HUANG1,2, Junjie SONG1, Shubo ZHOU1,2(), Hao LIU1,2   

  1. 1.College of Information Science and Technology,Donghua University,Shanghai 201620,China
    2.Engineering Research Center of Digitalized Textile & Fashion Technology,Ministry of Education (Donghua University),Shanghai 201620,China
  • Received:2023-05-08 Revised:2023-06-29 Accepted:2023-07-13 Online:2023-12-04 Published:2024-04-10
  • Contact: Shubo ZHOU
  • About author:HUANG Rong, born in 1985, Ph. D., associate professor. His research interests include deep learning, image analysis.
    SONG Junjie, born in 1998, M. S. candidate. His research interests include deep learning, image analysis.
    ZHOU Shubo, born in 1988, Ph. D., lecturer. His research interests include deep learning, image analysis.
    LIU Hao, born in 1977, Ph. D., associate professor. His research interests include deep learning, machine vision.
  • Supported by:
    National Natural Science Foundation of China(62001099);Fundamental Research Funds for Central Universities(2232023D-30)

摘要:

现有的图像美学质量评价方法普遍使用卷积神经网络(CNN)提取图像特征,但受局部感受野机制的限制,CNN较难提取图像的全局特征,导致全局构图关系、全局色彩搭配等美学属性缺失。为解决该问题,提出基于自监督视觉Transformer(SSViT)模型的图像美学质量评价方法。利用自注意力机制建立图像局部块之间的长距离依赖关系,自适应地学习图像不同局部块之间的相关性,提取图像的全局特征,从而刻画图像的美学属性;同时,设计图像降质分类、图像美学质量排序和图像语义重构这3项美学质量感知任务,利用无标注的图像数据对视觉Transformer(ViT)进行自监督预训练,增强全局特征的表达能力。在AVA(Aesthetic Visual Assessment)数据集上的实验结果显示,SSViT模型在美学质量分类准确率、皮尔森线性相关系数(PLCC)和斯皮尔曼等级相关系数(SRCC)指标上分别达到83.28%、0.763 4和0.746 2。以上实验结果表明,SSViT模型具有较高的图像美学质量评价准确性。

关键词: 图像美学质量评价, 视觉Transformer, 自监督学习, 全局特征, 自注意力机制

Abstract:

The existing image aesthetic quality evaluation methods widely use Convolution Neural Network (CNN) to extract image features. Limited by the local receptive field mechanism, it is difficult for CNN to extract global features from a given image, thereby resulting in the absence of aesthetic attributes like global composition relations, global color matching and so on. In order to solve this problem, an image aesthetic quality evaluation method based on SSViT (Self-Supervised Vision Transformer) model was proposed. Self-attention mechanism was utilized to establish long-distance dependencies among local patches of the image and to adaptively learn their correlations, and extracted the global features so as to characterize the aesthetic attributes. Meanwhile, three tasks of perceiving the aesthetic quality, namely classifying image degradation, ranking image aesthetic quality, and reconstructing image semantics, were designed to pre-train the vision Transformer in a self-supervised manner using unlabeled image data, so as to enhance the representation of global features. The experimental results on AVA (Aesthetic Visual Assessment) dataset show that the SSViT model achieves 83.28%, 0.763 4, 0.746 2 on the metrics including evaluation accuracy, Pearson Linear Correlation Coefficient (PLCC) and SRCC (Spearman Rank-order Correlation Coefficient), respectively. These experimental results demonstrate that the SSViT model achieves higher accuracy in image aesthetic quality evaluation.

Key words: image aesthetic quality evaluation, Vision Transformer (ViT), self-supervised learning, global feature, self-attention mechanism

中图分类号: