Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (3): 794-800.DOI: 10.11772/j.issn.1001-9081.2024091251

• Frontier research and typical applications of large models • Previous Articles     Next Articles

Synaesthesia metaphor analysis based on large language model and data augmentation

Kun SHENG, Zhongqing WANG()   

  1. School of Computer Science and Technology,Soochow University,Suzhou Jiangsu 215006,China
  • Received:2024-09-05 Revised:2024-10-30 Accepted:2024-10-31 Online:2024-11-13 Published:2025-03-10
  • Contact: Zhongqing WANG
  • About author:SHENG Kun, born in 2000, M. S. candidate. His research interests include natural language processing, synaesthesia metaphor.
  • Supported by:
    National Natural Science Foundation of China(62076175)

基于大语言模型和数据增强的通感隐喻分析

盛坤, 王中卿()   

  1. 苏州大学 计算机科学与技术学院,江苏 苏州 215006
  • 通讯作者: 王中卿
  • 作者简介:盛坤(2000—),男,江苏扬州人,硕士研究生,CCF会员,主要研究方向:自然语言处理、通感隐喻
  • 基金资助:
    国家自然科学基金资助项目(62076175)

Abstract:

Task of Chinese synaesthesia metaphor analysis is a specific subtask in metaphor domain. The uneven distribution of sensory words in synaesthesia corpora leads to data sparsity in the Chinese synaesthesia metaphor datasets. To address this issue, sparse sensory word data from real training data were used as prompts, and additional synthetic samples were generated by large language model for data augmentation. To avoid additional noise caused by introduced synthetic data from affecting model performance, a data augmentation framework based on large language model was constructed. Besides, a scoring mechanism and a label error optimization mechanism were applied to reduce the distribution differences between synthetic and real data. Experimental results show that the proposed framework can generate high-quality synthetic data to expand the dataset, and achieves an overall F1 value of 68.5% in sensory word extraction and sensory domain classification tasks, which is 2.7 percentage point improved compared to the baseline model T5 (Text-To-Text Transfer Transformer) trained only on real training data.

Key words: large language model, data enhancement, synaesthesia metaphor, data sparsity, data synthesis

摘要:

中文通感隐喻分析任务是隐喻领域的一个特定细分任务。由于通感语料中感觉词的分布不均匀,中文通感隐喻数据集存在数据稀疏的问题。为解决这一问题,利用真实训练数据中的稀疏感觉词数据作为提示,并使用大语言模型生成额外的合成样本进行数据增强。为避免合成数据的引入造成的额外噪声影响模型性能,构建基于大语言模型的数据增强框架,并采用评分机制和标签误差优化机制减小合成数据和真实数据之间的分布差异。实验结果表明,所提框架可以生成高质量的合成数据来扩充数据集,在感觉词抽取和感觉领域分类任务上的总体F1值达到68.5%,比仅使用真实训练数据的基线模型T5(Text-To-Text Transfer Transformer)提升了2.7个百分点。

关键词: 大语言模型, 数据增强, 通感隐喻, 数据稀疏, 数据合成

CLC Number: