Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (3): 815-822.DOI: 10.11772/j.issn.1001-9081.2024010013

• Frontier research and typical applications of large models • Previous Articles     Next Articles

Speaker-emotion voice conversion method with limited corpus based on large language model and pre-trained model

Chaofeng LU1, Ye TAO1(), Lianqing WEN1, Fei MENG2, Xiugong QIN3, Yongjie DU4, Yunlong TIAN4   

  1. 1.School of Information Science and Technology,Qingdao University of Science and Technology,Qingdao Shandong 266061,China
    2.School of Computer Science and Engineering,Linyi University,Linyi Shandong 276000,China
    3.Beijing Research Institute of Automation for Machinery Industry Company Limited,Beijing 100120,China
    4.Digital Home Network National Engineering Laboratory,Qingdao Shandong 266000,China
  • Received:2024-01-11 Revised:2024-03-22 Accepted:2024-03-22 Online:2024-05-09 Published:2025-03-10
  • Contact: Ye TAO
  • About author:LU Chaofeng, born in 1999, M. S. His research interests include speech synthesis, neural language processing.
    WEN Lianqing, born in 1999, M. S. candidate. His research interests include speech synthesis, signal processing.
    MENG Fei, born in 2001. Her research interests include machine learning, data mining, speech synthesis.
    QIN Xiugong, born in 1990, Ph. D., senior engineer. His research interests include human-computer interaction, service robotics.
    DU Yongjie, born in 1984, senior engineer. His research interests include image recognition, speech recognition, internet of things communication.
    TIAN Yunlong, born in 1981, M. S., senior engineer. His research interests include digital transformation of smart home brain, new generation of artificial intelligence scenario application.
  • Supported by:
    National Key Research and Development Program of China(2023YFF0612100);Key Technology Research and Industrialization Demonstration Project in Qingdao(24-1-2-qljh-19-gx)

融合大语言模型和预训练模型的少量语料说话人-情感语音转换方法

鲁超峰1, 陶冶1(), 文连庆1, 孟菲2, 秦修功3, 杜永杰4, 田云龙4   

  1. 1.青岛科技大学 信息科学技术学院,山东 青岛 266061
    2.临沂大学 信息科学与工程学院,山东 临沂 276000
    3.北京机械工业自动化研究所有限公司,北京 100120
    4.数字家庭网络国家工程实验室,山东 青岛 266000
  • 通讯作者: 陶冶
  • 作者简介:鲁超峰(1999—),男,山东菏泽人,硕士,主要研究方向:语音合成、自然语言处理
    文连庆(1999—),男,黑龙江哈尔滨人,硕士研究生,主要研究方向:语音合成、信号处理
    孟菲(2001—),女,山东菏泽人,主要研究方向:机器学习、数据挖掘、语音合成
    秦修功(1990—),男,山东滕州人,高级工程师,博士,主要研究方向:人机交互、服务机器人
    杜永杰(1984—),男,上海人,高级工程师,主要研究方向:图像识别、语音识别、物联网通信
    田云龙(1981—),男,山东青岛人,高级工程师,硕士,主要研究方向:智慧家庭大脑数字化转型、新一代人工智能场景化应用。
  • 基金资助:
    国家重点研发计划项目(2023YFF0612100);青岛市关键技术攻关及产业化示范类项目(24-1-2-qljh-19-gx)

Abstract:

Aiming at the problems that few people have combined research on speaker conversion and emotional voice conversion, and the emotional corpora of a target speaker in actual scenes are usually small, which are not enough to train strong generalization models from scratch, a Speaker-Emotion Voice Conversion with Limited corpus (LSEVC) was proposed with fusion of large language model and pre-trained emotional speech synthesis model. Firstly, a large language model was used to generate text with required emotion tags. Secondly, a pre-trained emotional speech synthesis model was fine-tuned by using the target speaker corpus to embed into the target speaker. Thirdly, the emotional speech was synthesized from the generated text for data augmentation. Fourthly, the synthesized speech and source target speech were used to co-train speaker-emotion voice conversion model. Finally, to further enhance speaker similarity and emotional similarity of converted speech, the model was fine-tuned by using source target speaker’s emotional speech. Experiments were conducted on publicly available corpora and a Chinese fiction corpus. Experimental results show that the proposed method outperforms CycleGAN-EVC, Seq2Seq-EVC-WA2, SMAL-ET2 and other methods when considering evaluation indicators — Emotional similarity Mean Opinion Score (EMOS), Speaker similarity Mean Opinion Score (SMOS), Mel Cepstral Distortion (MCD), and Word Error Rate (WER).

Key words: limited corpus, speaker-emotion voice conversion, large language model, pre-trained emotional speech synthesis model, fine-tuning

摘要:

针对很少有人将说话人转换和情感转换结合起来研究,且实际场景中的目标说话人情感语料通常很少,不足以从头训练一个强泛化性模型的问题,提出一种融合大语言模型和预训练情感语音合成模型的少量语料说话人-情感语音转换(LSEVC)方法。首先,使用大语言模型生成带有所需情感标签的文本;其次,使用目标说话人语料微调预训练情感语音合成模型以嵌入目标说话人;然后,将生成的文本合成情感语音,以达到数据增强的目的;再次,使用合成语音与源目标语音共同训练说话人-情感语音转换模型;最后,为了进一步提升转换语音的说话人相似度和情感相似度,使用源目标说话人情感语音微调模型。在公共语料库和一个中文小说语料库上的实验结果表明,综合考虑评价指标情感相似度平均得分(EMOS)、说话人相似度平均意见得分(SMOS)、梅尔倒谱失真(MCD)和词错误率(WER)时,所提方法优于CycleGAN-EVC、Seq2Seq-EVC-WA2和SMAL-ET2等方法。

关键词: 少量语料, 说话人-情感语音转换, 大语言模型, 预训练情感语音合成模型, 微调

CLC Number: