Journal of Computer Applications

    Next Articles

Sentence-BERT neural topic model with diffusion prior

  

  • Received:2025-12-19 Revised:2026-03-27 Online:2026-04-30 Published:2026-04-30
  • Supported by:
    Shanxi Scholarship Council of China;Fundamental Research Program of Shanxi Province

融入扩散先验的Sentence-BERT神经主题模型

王思佳1,王钰2   

  1. 1. 山西大学
    2. 山西大学计算中心
  • 通讯作者: 王钰
  • 基金资助:
    山西省省筹资金资助回国留学人员科研项目;山西省基础研究计划资助项目

Abstract: Aiming at the Gaussian prior distribution adopted by traditional neural topic models, which leads to the problems of limited expressive capacity of the topic space and difficulty in balancing topic coherence and diversity, a Sentence-BERT (SBERT) neural topic model incorporating diffusion prior was proposed. First, the pre-trained SBERT model was used to perform semantic encoding on the input documents, and its bidirectional Transformer structure was employed to capture the dependencies between words in the document and the global semantic information, mapping the entire document into a fixed-dimensional low-dimensional dense embedding representation, which was used as the input of the neural topic model to effectively retain the contextual semantic coherence of the document. Then, the Gaussian prior distribution assumption in traditional neural topic models was abandoned, and a diffusion model was introduced to flexibly model the topic prior distribution. Through the step-by-step denoising and random sampling characteristics of the diffusion process, the constraint of the fixed prior distribution on the topic space was broken. Finally, the document embeddings extracted by SBERT were fused with the topic prior modeled by the diffusion model, which significantly improves the diversity of topic representations on the premise of maintaining the contextual semantic coherence of the document. Experimental results on three commonly used topic model datasets, namely 20Newsgroups, GoogleNews, and RCV1-v2, show that compared with existing benchmark models (such as NVDM, ETM, ProdLDA, and LDA), the proposed model reduces the perplexity by 11.49% compared with the optimal ProdLDA, improves the topic coherence by 0.77% compared with the optimal ETM, and increases the topic diversity by 0.27% compared with the optimal ProdLDA. The proposed model alleviates the shortcomings of traditional neural topic models to a certain extent, such as single prior distribution and insufficient expressive capacity. It can significantly enhance topic diversity while maintaining topic semantic coherence, providing a more effective modeling approach and technical solution for high-quality and high-diversity topic mining under large-scale text data.

Key words: Topic Model, Diffusion Model, Prior Distribution, Pre-trained Language Model, Word Embedding Representation

摘要: 针对传统神经主题模型采用高斯先验分布导致主题空间表达能力受限、难以兼顾主题一致性与多样性的问题,提出一种融入扩散先验的SBERT(Sentence-BERT)神经主题模型。首先,采用预训练的SBERT模型对输入文档进行语义编码,通过其双向Transformer结构捕捉文档内部词语间的依赖关系与全局语义信息,将整个文档映射为固定维度的低维稠密嵌入表示,以此作为神经主题模型的输入,有效保留文档上下文语义一致性;其次,摒弃传统神经主题模型中固定的高斯先验分布假设,引入扩散模型对主题先验分布进行灵活建模,通过扩散过程的逐步去噪与随机采样特性,打破固定先验分布对主题空间的约束;最后,将SBERT提取的文档嵌入与扩散模型建模的主题先验进行融合,在保持文档上下文语义一致性的前提下,显著提高主题表示的多样性。在20Newsgroups、GoogleNews和RCV1-v2三个常用主题模型数据集上的实验结果表明,与现有基准模型(NVDM,ETM,ProdLDA,LDA等)相比,所提模型在困惑度指标上比最优的ProdLDA降低了11.49%,在主题一致性指标上比最优的ETM提升了0.77%,在主题多样性指标上比最优的ProdLDA提升了0.27%。所提模型一定程度上缓解了传统神经主题模型先验分布单一、表达能力不足的缺陷,能够在保持主题语义一致性的同时显著提升主题多样性,为大规模文本数据下高质量、高多样性的主题挖掘提供了更有效的建模思路与技术方案。

关键词: 主题模型, 扩散模型, 先验分布, 预训练语言模型, 词嵌入表示

CLC Number: