Aiming at the problems that few people have combined research on speaker conversion and emotional voice conversion, and the emotional corpora of a target speaker in actual scenes are usually small, which are not enough to train strong generalization models from scratch, a Speaker-Emotion Voice Conversion with Limited corpus (LSEVC) was proposed with fusion of large language model and pre-trained emotional speech synthesis model. Firstly, a large language model was used to generate text with required emotion tags. Secondly, a pre-trained emotional speech synthesis model was fine-tuned by using the target speaker corpus to embed into the target speaker. Thirdly, the emotional speech was synthesized from the generated text for data augmentation. Fourthly, the synthesized speech and source target speech were used to co-train speaker-emotion voice conversion model. Finally, to further enhance speaker similarity and emotional similarity of converted speech, the model was fine-tuned by using source target speaker’s emotional speech. Experiments were conducted on publicly available corpora and a Chinese fiction corpus. Experimental results show that the proposed method outperforms CycleGAN-EVC, Seq2Seq-EVC-WA2, SMAL-ET2 and other methods when considering evaluation indicators — Emotional similarity Mean Opinion Score (EMOS), Speaker similarity Mean Opinion Score (SMOS), Mel Cepstral Distortion (MCD), and Word Error Rate (WER).