Quality Assessment of Short Videos Based on the CLIP Model

doi:10.11772/j.issn.1001-9081.2025020201

Abstract

Abstract: To address the difficulty in effectively assessing the quality of short videos with rich content and complex structure, a method based on the Contrastive Language-Image Pre-Training (CLIP) model is proposed. Firstly, according to the special form of short videos, an efficient structural feature extraction module is designed to capture its textual and layout characteristics. Next, to enhance the representation of global features, a multi-feature extractor is developed to capture quality features across three dimensions: the spatiotemporal quality, the structural quality, and the perceptual quality, ensuring comprehensive coverage of semantic information and distortion characteristics. Finally, a text input template is constructed to guide the quality feature fusion process using the CLIP features extracted from video frames. Experimental results on four benchmark datasets demonstrate that the proposed method achieves superior accuracy and stability. Specifically, on the KVQ short video dataset, the Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank order Correlation Coefficient (SRCC) reach 0.922 and 0.919, respectively. On the TaoLive livestream dataset, the two metrics show an average improvement of 1.1% compared to the second-best method. In terms of generalization, the cross-dataset performance achieves an average improvement of 1.6%, which is suitable for a wide range of application scenarios.

Key words: short video, video quality assessment, structural feature, Contrastive Language-Image Pre-Training model, Human visual system

摘要： 针对短视频内容丰富、结构复杂，难以进行有效质量评估的问题，提出了一种基于CLIP模型的短视频质量评价方法。首先依据短视频的特殊形式，设计了一个高效的结构特征提取模块，用来捕捉其文本、布局特性；在此基础上，构建了多特征提取器，从时空质量、结构质量和感知质量三方面捕捉视频不同维度的质量特征，包括全面的语义信息和失真特性；最后，构建文本输入模板，利用视频帧的CLIP特征引导质量特征融合过程。在四个主流数据集上的结果表明，该算法具有更高的准确性和稳定性。在短视频数据集KVQ上的皮尔逊线性相关系数(PLCC)和斯皮尔曼秩相关系数(SRCC)分别达到0.922，0.919；在直播数据集TaoLive上两个指标相对于次优方法平均提升了1.1%。在泛化性方面，跨数据集效果平均提升1.6%，适用于广泛的应用场景。

关键词: 短视频, 视频质量评价, 结构特征, CLIP模型, 人类视觉系统

CLC Number:

TP391

程帅博颜佳. 基于CLIP模型的短视频质量评价[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025020201.

[1]	Huanxian LIU, Hongtao WANG, Xian’ao WANG, Hongmei WANG, Weifeng XU. Multimodal fact verification with cross-modal semantic association [J]. Journal of Computer Applications, 2026, 46(4): 1069-1076.
[2]	Yang DENG, Tao ZHAO, Kai SUN, Tong TONG, Qinquan GAO. No-reference image quality assessment algorithm based on saliency features and cross-attention mechanism [J]. Journal of Computer Applications, 2025, 45(12): 3995-4003.
[3]	Yun LI, Fuyou WANG, Peiguang JING, Su WANG, Ao XIAO. Uncertainty-based frame associated short video event detection method [J]. Journal of Computer Applications, 2024, 44(9): 2903-2910.
[4]	Zhuoran LI, Zhonglin YE, Haixing ZHAO, Jingjing LIN. Graph convolutional network method based on hybrid feature modeling [J]. Journal of Computer Applications, 2022, 42(11): 3354-3363.
[5]	DONG Wentao, LI Zhuo, CHEN Xin. Online short video content distribution strategy based on federated learning [J]. Journal of Computer Applications, 2021, 41(6): 1551-1556.
[6]	ZHOU Chaoran, ZHAO Jianping, MA Tai, ZHOU Xin. Web page blacklist discrimination method based on attention mechanism and ensemble learning [J]. Journal of Computer Applications, 2021, 41(1): 133-138.
[7]	ZHAO Qing, YU Yuanhui. 3D face recognition based on hierarchical feature network [J]. Journal of Computer Applications, 2020, 40(9): 2514-2518.
[8]	LUO Xiaoxia, SI Fengwei, LUO Xiangyu. Effects of large-scale graph structural feature on partitioning quality [J]. Journal of Computer Applications, 2018, 38(1): 1-5.
[9]	CHEN Shuqin, LI Zhi, CHENG Xinyu, GAO Qi. Dual watermarking algorithm based on human visual characteristics and SIFT [J]. Journal of Computer Applications, 2017, 37(7): 1936-1942.
[10]	WANG Man, YAN Jia, WU Minyuan. Objective quality assessment for color-to-gray images based on visual similarity [J]. Journal of Computer Applications, 2017, 37(10): 2926-2931.
[11]	CHEN Weiye, SUN Quansen. Image super-resolution reconstruction combined with compressed sensing and nonlocal information [J]. Journal of Computer Applications, 2016, 36(9): 2570-2575.
[12]	TIAN Jinsha, HAN Yongguo, WU Yadong, ZHAO Xiaole, ZHANG Hongying. No-reference image quality assessment based on scale invariance [J]. Journal of Computer Applications, 2016, 36(3): 789-794.
[13]	FANG Zhiwen, CAO Zhiguo, ZHU Lei. Image matching algorithm based on histogram of gradient angle local feature descriptor [J]. Journal of Computer Applications, 2015, 35(4): 1079-1083.
[14]	LI Honglin ZHANG Qi YANG Dawei. Improved algorithm for no-reference quality assessment of blurred image [J]. Journal of Computer Applications, 2014, 34(3): 797-800.
[15]	GAO Haibo DENG Xiaohong CHEN Zhigang. Medical Image Privacy Protection Scheme Based on Reversible Visible Watermarking [J]. Journal of Computer Applications, 2014, 34(1): 119-123.

Quality Assessment of Short Videos Based on the CLIP Model

基于CLIP模型的短视频质量评价

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics