Sentence-BERT neural topic model with diffusion prior

doi:10.11772/j.issn.1001-9081.2025121521

Abstract

Abstract: Aiming at the Gaussian prior distribution adopted by traditional neural topic models, which leads to the problems of limited expressive capacity of the topic space and difficulty in balancing topic coherence and diversity, a Sentence-BERT (SBERT) neural topic model incorporating diffusion prior was proposed. First, the pre-trained SBERT model was used to perform semantic encoding on the input documents, and its bidirectional Transformer structure was employed to capture the dependencies between words in the document and the global semantic information, mapping the entire document into a fixed-dimensional low-dimensional dense embedding representation, which was used as the input of the neural topic model to effectively retain the contextual semantic coherence of the document. Then, the Gaussian prior distribution assumption in traditional neural topic models was abandoned, and a diffusion model was introduced to flexibly model the topic prior distribution. Through the step-by-step denoising and random sampling characteristics of the diffusion process, the constraint of the fixed prior distribution on the topic space was broken. Finally, the document embeddings extracted by SBERT were fused with the topic prior modeled by the diffusion model, which significantly improves the diversity of topic representations on the premise of maintaining the contextual semantic coherence of the document. Experimental results on three commonly used topic model datasets, namely 20Newsgroups, GoogleNews, and RCV1-v2, show that compared with existing benchmark models (such as NVDM, ETM, ProdLDA, and LDA), the proposed model reduces the perplexity by 11.49% compared with the optimal ProdLDA, improves the topic coherence by 0.77% compared with the optimal ETM, and increases the topic diversity by 0.27% compared with the optimal ProdLDA. The proposed model alleviates the shortcomings of traditional neural topic models to a certain extent, such as single prior distribution and insufficient expressive capacity. It can significantly enhance topic diversity while maintaining topic semantic coherence, providing a more effective modeling approach and technical solution for high-quality and high-diversity topic mining under large-scale text data.

Key words: Topic Model, Diffusion Model, Prior Distribution, Pre-trained Language Model, Word Embedding Representation

摘要： 针对传统神经主题模型采用高斯先验分布导致主题空间表达能力受限、难以兼顾主题一致性与多样性的问题，提出一种融入扩散先验的SBERT（Sentence-BERT）神经主题模型。首先，采用预训练的SBERT模型对输入文档进行语义编码，通过其双向Transformer结构捕捉文档内部词语间的依赖关系与全局语义信息，将整个文档映射为固定维度的低维稠密嵌入表示，以此作为神经主题模型的输入，有效保留文档上下文语义一致性；其次，摒弃传统神经主题模型中固定的高斯先验分布假设，引入扩散模型对主题先验分布进行灵活建模，通过扩散过程的逐步去噪与随机采样特性，打破固定先验分布对主题空间的约束；最后，将SBERT提取的文档嵌入与扩散模型建模的主题先验进行融合，在保持文档上下文语义一致性的前提下，显著提高主题表示的多样性。在20Newsgroups、GoogleNews和RCV1-v2三个常用主题模型数据集上的实验结果表明，与现有基准模型（NVDM，ETM，ProdLDA，LDA等）相比，所提模型在困惑度指标上比最优的ProdLDA降低了11.49%，在主题一致性指标上比最优的ETM提升了0.77%，在主题多样性指标上比最优的ProdLDA提升了0.27%。所提模型一定程度上缓解了传统神经主题模型先验分布单一、表达能力不足的缺陷，能够在保持主题语义一致性的同时显著提升主题多样性，为大规模文本数据下高质量、高多样性的主题挖掘提供了更有效的建模思路与技术方案。

关键词: 主题模型, 扩散模型, 先验分布, 预训练语言模型, 词嵌入表示

CLC Number:

TP391.1
TP18

王思佳王钰. 融入扩散先验的Sentence-BERT神经主题模型[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025121521.

[1]	Peirong SHAO, Suzhen LIN, Yanbo WANG. Human-centric detail-enhanced virtual try-on method [J]. Journal of Computer Applications, 2026, 46(3): 915-923.
[2]	Yan HU, Peng LI, Shuyan CHENG. Adversarial purification method based on directly guided diffusion model [J]. Journal of Computer Applications, 2026, 46(3): 821-829.
[3]	Min CHEN, Xiaolin QIN, Shaohan LI, Hao YANG, Taohong LI. Review of deep learning applications in severe convective weather prediction [J]. Journal of Computer Applications, 2026, 46(3): 980-992.
[4]	Haoqian JIANG, Dong ZHANG, Guanyu LI, Heng CHEN. SetaCRS： Conversational recommender system with structure-enhanced hierarchical task-oriented prompting strategy [J]. Journal of Computer Applications, 2026, 46(2): 368-377.
[5]	Hongjian WEN, Ruijiao HU, Baowen WU, Jiaxing SUN, Huan LI, Qing ZHANG, Jie LIU. Chinese automated essay scoring based on joint learning of multi-scale features using graph neural network [J]. Journal of Computer Applications, 2026, 46(2): 378-385.
[6]	Yuan JIA, Deyu YUAN, Yuquan PAN, Anran WANG. Watermarking method for diffusion model output [J]. Journal of Computer Applications, 2026, 46(1): 161-168.
[7]	Zhiyuan WANG, Tao PENG, Jie YANG. Integrating internal and external data for out-of-distribution detection training and testing [J]. Journal of Computer Applications, 2025, 45(8): 2497-2506.
[8]	Xingjie FENG, Xingpeng BIAN, Xiaorong FENG, Xinglong WANG. Incremental missing value imputation algorithm for time series based on diffusion model [J]. Journal of Computer Applications, 2025, 45(8): 2582-2591.
[9]	Wei ZHANG, Jiaxiang NIU, Jichao MA, Qiongxia SHEN. Chinese spelling correction model ReLM enhanced with deep semantic features [J]. Journal of Computer Applications, 2025, 45(8): 2484-2490.
[10]	Liqin WANG, Zhilei GENG, Yingshuang LI, Yongfeng DONG, Meng BIAN. Open-world knowledge reasoning model based on path and enhanced triplet text [J]. Journal of Computer Applications, 2025, 45(4): 1177-1183.
[11]	Qiang LI, Shaoxiong BAI, Yuan XIONG, Wei YUAN. Privacy preserving localization of surveillance images based on large vision models [J]. Journal of Computer Applications, 2025, 45(3): 832-839.
[12]	Jing ZHOU, Zhenyang TANG, Hui DONG, Xin LIU. Multi-label text classification method of power customer service work orders integrating feature enhancement and contrastive learning [J]. Journal of Computer Applications, 2025, 45(12): 3847-3854.
[13]	Bingjie QIU, Chaoqun ZHANG, Weidong TANG, Bicheng LIANG, Danyang CUI, Haisheng LUO, Qiming CHEN. Zero-shot relation extraction model based on dual contrastive learning [J]. Journal of Computer Applications, 2025, 45(11): 3555-3563.
[14]	Shuang LIU, Guijun LUO, Jiana MENG. Joint extraction model of entities and relations based on memory enhancement and span screening [J]. Journal of Computer Applications, 2025, 45(11): 3564-3572.
[15]	Bin LI, Min LIN, Siriguleng, Yingjie GAO, Yurong WANG, Shujun ZHANG. Joint entity-relation extraction method for ancient Chinese books based on prompt learning and global pointer network [J]. Journal of Computer Applications, 2025, 45(1): 75-81.

Sentence-BERT neural topic model with diffusion prior

融入扩散先验的Sentence-BERT神经主题模型

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics