《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (10): 3138-3145.DOI: 10.11772/j.issn.1001-9081.2024091371

• 人工智能 • 上一篇    

融合BERT与X-means算法的微博舆情热度分析预测模型

蒋章涛, 李欣, 张士豪(), 赵心阳   

  1. 中国人民公安大学 信息网络安全学院,北京 100038
  • 收稿日期:2024-09-27 修回日期:2025-02-14 接受日期:2025-02-17 发布日期:2025-03-14 出版日期:2025-10-10
  • 通讯作者: 张士豪
  • 作者简介:蒋章涛(2000—),男,山东济南人,硕士研究生,CCF会员,主要研究方向:网络安全、技术情报
    李欣(1977—),男,江西赣州人,教授,博士,CCF会员,主要研究方向:人工智能、网络安全
    张士豪(1992—),男,山西临汾人,讲师,博士研究生,主要研究方向:数据挖掘、社会网络分析 Email:Zhangshihao@ppsuc.edu.cn
    赵心阳(2002—),男,山东临沂人,硕士研究生,主要研究方向:信息隐藏。
  • 基金资助:
    中央高校基本科研业务费专项资金资助项目(2020JKF316)

Analysis and prediction model of Weibo public opinion heat by integrating BERT and X-means algorithm

Zhangtao JIANG, Xin LI, Shihao ZHANG(), Xinyang ZHAO   

  1. School of Information and Cyber Security,People’s Public Security University of China,Beijing 100038,China
  • Received:2024-09-27 Revised:2025-02-14 Accepted:2025-02-17 Online:2025-03-14 Published:2025-10-10
  • Contact: Shihao ZHANG
  • About author:JIANG Zhangtao, born in 2000, M. S. candidate. His research interests include cyber security, technical intelligence.
    LI Xin, born in 1977, Ph. D., professor. His research interests include artificial intelligence, cyber security.
    ZHANG Shihao, born in 1992, Ph. D. candidate, lecturer. His research interests include data mining, social network analysis.
    ZHAO Xinyang, born in 2002, M. S. candidate. His research interests include information hiding.
  • Supported by:
    Fundamental Research Funds for Central Universities(2020JKF316)

摘要:

在微博等社交媒体的舆情发现和预测中,网络水军制造的“假热点”会影响分析准确性。为真实反映微博舆情热度,提出一种融合BERT(Bidirectional Encoder Representations from Transformers)和X-means算法的微博舆情热度分析预测模型BXpre,旨在融合微博参与用户的属性特征与热度变化的时域特征,以提高热度预测的准确性。首先,对微博原文和互动用户的数据进行预处理,利用微调后的StructBERT模型对这些数据分类,从而确定参与互动的用户与微博原文的关联度,作为用户对该微博热度增长的贡献度权重计算的参考值;其次,使用X-means算法,以互动用户的特征为依据进行聚类,基于所得聚集态的同质性特征过滤水军,并引入针对水军样本的权重惩罚机制,结合标签关联度,进一步构建微博热度指标模型;最后,通过计算先验热度值随时间变化的二阶导数与真实数据的余弦相似度预测未来微博热度变化。实验结果表明,BXpre在不同用户量级下输出的微博舆情热度排序结果更贴近真实数据,在混合量级测试条件下,BXpre的预测相关性指标达到了90.88%,相较于基于长短期记忆(LSTM)网络、极限梯度提升(XGBoost)算法和时序差值排序(TDR)的3种传统方法,分别提升了12.71、14.80和11.30个百分点;相较于ChatGPT和文心一言,分别提升了9.76和11.95个百分点。

关键词: 微博舆情热度分析预测, BERT模型, X-means算法, 水军识别, 社交网络分析

Abstract:

In public opinion discovery and prediction on social media platforms such as Weibo, “fake hotspots” created by internet trolls will affect analysis accuracy. To reflect Weibo public opinion heat accurately, a Weibo public opinion heat analysis and prediction model integrating BERT (Bidirectional Encoder Representations from Transformers) and X-means algorithm, called BXpre, was proposed, which was designed to integrate attribute features of the participating users and time domain features of the heat changes, thereby improving prediction accuracy of heat. Firstly, Weibo original posts and interaction user data were preprocessed, and the fine-tuned StructBERT model was used to classify these data, determining the relevance between interaction users and the original posts. This relevance was used as a reference value for calculating users’ contribution weights to the heat growth of the posts. Secondly, interaction users were clustered according to their features by using X-means algorithm, and trolls were filtered based on the resulting cluster states. After that, a weight penalty mechanism targeting troll samples was introduced, and a Weibo heat index model was further constructed by combining label relevance. Finally, cosine similarity of the second derivative of the prior heat value varying with time and real data was calculated to predict future changes in Weibo heat. Experimental results show that BXpre has the Weibo public opinion heat rankings produced by the model closer to the real data under different user scales. Under mixed-scale test conditions, BXpre has the prediction correlation index reached 90.88%, which is improved by 12.71, 14.80, and 11.30 percentage points compared with three traditional methods based on LSTM (Long Short-Term Memory) network, XGBoost (eXtreme Gradient Boosting) algorithm, and TDR (Temporal Difference Ranking) separately, and is improved by 9.76 and 11.95 percentage points, respectively, compared with ChatGPT and Wenxin Yiyan.

Key words: Weibo public opinion heat analysis and prediction, BERT model, X-means algorithm, troll detection, social network analysis

中图分类号: