Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (12): 3513-3519.DOI: 10.11772/j.issn.1001-9081.2020040549

• Artificial intelligence • Previous Articles     Next Articles

Hot new word discovery applied for detection of network hot news

WANG Yu, XU Jianmin   

  1. School of Cyber Security and Computer, Hebei University, Baoding Hebei 071000 China
  • Received:2020-04-28 Revised:2020-06-27 Online:2020-12-10 Published:2020-07-20
  • Supported by:
    This work is partially supported by the National Social Science Foundation of China (17FTQ002), the Social Science Foundation of Hebei Province (HB15SH064).

用于网络新闻热点识别的热点新词发现

王煜, 徐建民   

  1. 河北大学 网络空间安全与计算机学院, 河北 保定 071000
  • 通讯作者: 王煜(1971-),女,河北保定人,教授,博士,主要研究方向:文本挖掘、信息检索。wy@mail.hbu.edu.cn
  • 作者简介:徐建民(1966-),男,河北保定人,教授,博士,主要研究方向:个性化信息检索、Web社区发现、话题识别与追踪、社会网络建模
  • 基金资助:
    国家社会科学基金资助项目(17FTQ002);河北省社会科学基金资助项目(HB15SH064)。

Abstract: By analyzing the characteristics of hot words in network news, a hot new word discovery method was proposed for detection of network hot news. Firstly, the Frequent Pattern tree (FP-tree) algorithm was improved to extract the frequent word strings as the hot new word candidates. A lot of useless information in the news data was reduced by deleting the infrequent 1-word strings from news data and cutting news data based on infrequent 1, 2-infrequent word strings, so as to greatly decrease the complexity of FP-tree. Secondly, the multivariant Pointwise Mutual Information (PMI)was formed by expanding the binary PMI, and the Time PMI (TPMI) was formed by introducing the time features of hot words. TPMI was used to judge the internal cohesion degree and timeliness of hot new word candidates, so as to remove the unqualified candidates. Finally, the branch entropy was used to determine the boundary of new words for selecting new hot words. The dataset formed by 7 222 news headlines collected from Baidu network news was used for the experiments. When the events reported at least 8 times in half a month were selected as hot news, and the adjustment coefficient of time feature was set 2, TPMI correctly recognized 51 hot words, missed 2 hot words because they were hot for a long time and 2 less-hot words because they occurred insufficiently; the multivariant PMI without time features correctly recognized all 55 hot words, but incorrectly recognized 97 non-hot words. It can be seen from the analysis that the time and space cost is reduced by decreasing the complexity of FP-tree, and experimental results show that the recognition rate of hot new words is improved by introducing time feature during the hot new word judgement.

Key words: hot new word, Frequent Pattern tree (FP-tree), Pointwise Mutual Information (PMI), branch entropy, time feature

摘要: 通过分析网络新闻热点词的特点,提出了一种用于网络新闻热点识别的热点新词发现方法。首先,用改进FP-tree算法提取频繁出现的词串作为热点新词候选,删除新闻数据中非频繁1-词串,并利用1、2-非频繁词串切割新闻数据,从而删除新闻数据中的大量无用信息,大幅降低FP-tree复杂度;其次,根据二元逐点互信息(PMI)扩展成多元PMI,并引入热点词的时间特征形成时间逐点互信息(TPMI),用TPMI判定热点新词候选的内部结合度和时间性,剔除不合格的候选词;最后,采用邻接熵确定候选新词边界,从而筛选出热点新词。采集百度网络新闻的7 222条新闻标题作为数据集进行实验验证。在将半月内报道次数不低于8次的事件作为热点新闻且时间特征的调节系数为2时,采用TPMI可以正确识别51个热点词,丢失识别2个长时间热点词和2个低热度词,而采用不加入时间特征的多元PMI可正确识别全部热点词55个,但错误识别97个非热点词。分析可知所提的算法降低了FP-tree复杂度,从而减少了时间空间代价,实验结果表明判定热点新词时加入时间特征提高了热点新词识别率。

关键词: 热点新词, FP-tree, 逐点互信息(PMI), 邻接熵, 时间特征

CLC Number: