Hot new word discovery applied for detection of network hot news

doi:10.11772/j.issn.1001-9081.2020040549

Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (12): 3513-3519.DOI: 10.11772/j.issn.1001-9081.2020040549

• Artificial intelligence • Previous Articles Next Articles

Hot new word discovery applied for detection of network hot news

WANG Yu, XU Jianmin

School of Cyber Security and Computer, Hebei University, Baoding Hebei 071000 China

Received:2020-04-28 Revised:2020-06-27 Online:2020-12-10 Published:2020-07-20
Supported by:
This work is partially supported by the National Social Science Foundation of China （17FTQ002）， the Social Science Foundation of Hebei Province （HB15SH064）.

用于网络新闻热点识别的热点新词发现

王煜, 徐建民

河北大学网络空间安全与计算机学院, 河北保定 071000

通讯作者: 王煜(1971-),女,河北保定人,教授,博士,主要研究方向:文本挖掘、信息检索。wy@mail.hbu.edu.cn
作者简介:徐建民(1966-),男,河北保定人,教授,博士,主要研究方向:个性化信息检索、Web社区发现、话题识别与追踪、社会网络建模
基金资助:
国家社会科学基金资助项目（17FTQ002）；河北省社会科学基金资助项目（HB15SH064）。

Abstract

Abstract: By analyzing the characteristics of hot words in network news, a hot new word discovery method was proposed for detection of network hot news. Firstly, the Frequent Pattern tree (FP-tree) algorithm was improved to extract the frequent word strings as the hot new word candidates. A lot of useless information in the news data was reduced by deleting the infrequent 1-word strings from news data and cutting news data based on infrequent 1, 2-infrequent word strings, so as to greatly decrease the complexity of FP-tree. Secondly, the multivariant Pointwise Mutual Information (PMI)was formed by expanding the binary PMI, and the Time PMI (TPMI) was formed by introducing the time features of hot words. TPMI was used to judge the internal cohesion degree and timeliness of hot new word candidates, so as to remove the unqualified candidates. Finally, the branch entropy was used to determine the boundary of new words for selecting new hot words. The dataset formed by 7 222 news headlines collected from Baidu network news was used for the experiments. When the events reported at least 8 times in half a month were selected as hot news, and the adjustment coefficient of time feature was set 2, TPMI correctly recognized 51 hot words, missed 2 hot words because they were hot for a long time and 2 less-hot words because they occurred insufficiently; the multivariant PMI without time features correctly recognized all 55 hot words, but incorrectly recognized 97 non-hot words. It can be seen from the analysis that the time and space cost is reduced by decreasing the complexity of FP-tree, and experimental results show that the recognition rate of hot new words is improved by introducing time feature during the hot new word judgement.

Key words: hot new word, Frequent Pattern tree (FP-tree), Pointwise Mutual Information (PMI), branch entropy, time feature

摘要： 通过分析网络新闻热点词的特点，提出了一种用于网络新闻热点识别的热点新词发现方法。首先，用改进FP-tree算法提取频繁出现的词串作为热点新词候选，删除新闻数据中非频繁1-词串，并利用1、2-非频繁词串切割新闻数据，从而删除新闻数据中的大量无用信息，大幅降低FP-tree复杂度；其次，根据二元逐点互信息（PMI）扩展成多元PMI，并引入热点词的时间特征形成时间逐点互信息（TPMI），用TPMI判定热点新词候选的内部结合度和时间性，剔除不合格的候选词；最后，采用邻接熵确定候选新词边界，从而筛选出热点新词。采集百度网络新闻的7 222条新闻标题作为数据集进行实验验证。在将半月内报道次数不低于8次的事件作为热点新闻且时间特征的调节系数为2时，采用TPMI可以正确识别51个热点词，丢失识别2个长时间热点词和2个低热度词，而采用不加入时间特征的多元PMI可正确识别全部热点词55个，但错误识别97个非热点词。分析可知所提的算法降低了FP-tree复杂度，从而减少了时间空间代价，实验结果表明判定热点新词时加入时间特征提高了热点新词识别率。

关键词: 热点新词, FP-tree, 逐点互信息（PMI）, 邻接熵, 时间特征

CLC Number:

TP391

WANG Yu, XU Jianmin. Hot new word discovery applied for detection of network hot news[J]. Journal of Computer Applications, 2020, 40(12): 3513-3519.

王煜, 徐建民. 用于网络新闻热点识别的热点新词发现[J]. 计算机应用, 2020, 40(12): 3513-3519.

References

[1] 张华平, 商建云. 面向社会媒体的开放领域新词发现[J]. 中文信息学报, 2017, 31(3):55-61.(ZHANG H P,SHANG J Y. Social media-oriented open domain new word detection[J]. Journal of Chinese Information Processing,2017,31(3):55-61.)
[2] 杜丽萍, 李晓戈, 于根, 等. 基于互信息改进算法的新词发现对中文分词系统改进[J]. 北京大学学报(自然科学版), 2016, 52(1):35-40.(DU L P,LI X G,YU G,et al. New word detection based on an improved PMI algorithm for enhancing segmentation system[J]. Acta Scientiarum Naturalium Universitatis Pekinensis,2016, 52(1):35-40.)
[3] 夭荣朋, 许国艳, 宋健. 基于改进互信息和邻接熵的微博新词发现方法[J]. 计算机应用, 2016, 36(10):2772-2776.(YAO R P, XU G Y,SONG J. Micro-blog new word discovery method based on improved mutual information and branch entropy[J]. Journal of Computer Applications,2016,36(10):2772-2776.)
[4] LI W,GUO K,SHI Y,et al. DWWP:domain-specific new words detection and word propagation system for sentiment analysis in the tourism domain[J]. Knowledge-Based Systems, 2018, 146:203-214.
[5] 刘伟童, 刘培玉, 刘文锋, 等. 基于互信息和邻接熵的新词发现算法[J]. 计算机应用研究, 2019, 36(5):1293-1296.(LIU W T, LIU P Y,LIU W F,et al. New word discovery algorithm based on mutual information and branch entropy[J]. Application Research of Computers,2019,36(5):1293-1296.)
[6] 张婧, 黄锴宇, 梁晨, 等. 面向中文社交媒体语料的无监督新词识别研究[J]. 中文信息学报, 2018, 32(3):17-25, 33.(ZHANG J, HUANG K Y,LIANG C,et al. Unsupervised new word extraction from Chinese social media data[J]. Journal of Chinese Information Processing,2018,32(3):17-25,33.)
[7] ZHANG S,ZHU H,XU Z. The extraction method of new logining word/term for social media based on statistics and N-increment[EB/OL].[2020-03-20]. https://www.onacademic.com/detail/journal_1000040155947110_203b.html.
[8] 韩彦昭, 乔亚男, 范亚平, 等. 基于条件随机场模型和文本纠错的微博新词词性识别研究[J]. 南京大学学报(自然科学), 2016, 52(2):353-360.(HAN Y Z,QIAO Y N,FAN Y P,et al. Part-ofspeech tagging of microblog unknown words based on conditional random fields and error correction[J]. Journal of Nanjing University(Natural Sciences),2016,52(2):353-360.)
[9] 李少峰. 面向食品安全的新词发现和热词排行方法的研究与应用[D]. 广州:中山大学, 2015:15-26.(LI S F. Research and application on new word discovery and hot word ranking for food security[D]. Guangzhou:Sun Yat-sen University,2015:15-26.)
[10] 张长. 金融知识自动问答中的新词发现及答案排序方法[D]. 哈尔滨:哈尔滨工业大学, 2017:16-26.(ZHANG C. The method of new words discovery and answers ranking in finance question answering[D]. Harbin:Harbin Institute of Technology,2017:16-26.)
[11] 刘昱彤, 吴斌, 谢韬, 等. 基于古汉语语料的新词发现方法[J]. 中文信息学报, 2019, 33(1):46-55.(LIU Y T,WU B,XIE T, et al. New word detection in ancient Chinese corpus[J]. Journal of Chinese Information Processing,2019,33(1):46-55.)
[12] 王馨, 王煜, 王亮. 基于新词发现的网络新闻热点排名[J]. 图书情报工作, 2015, 59(6):68-74.(WANG X,WANG Y,WANG L. Hot news ranking ofnetwork news based on new words detection[J]. Library and Information Service,2015,59(6):68-74.)
[13] PECINA P,SCHLESINGER P. Combining association measures for collocation extraction[C]//Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Stroudsburg:ACL,2006:651-658.
[14] BOUMA G. Normalized (pointwise) mutual information in collocation extraction[C]//Proceedings of the 2009 International Conference of the German Society for Computational Linguistics and Language Technology. Berlin:Springer,2009:31-40.
[15] HUANG J H,POWERS D. Chinese word segmentation based on contextual entropy[C]//Proceedings of the 200317th Pacific Asia Conference on Language, Information and Computation. Piscataway:IEEE,2003:152-158.

Hot new word discovery applied for detection of network hot news

用于网络新闻热点识别的热点新词发现

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 5

Recommended Articles

Metrics

[1]	LIU Shengnan, NING Jifeng. Super-pixel based pointwise mutual information boundary detection algorithm [J]. Journal of Computer Applications, 2016, 36(8): 2296-2300.
[2]	XU Yongxiu, LIU Xumin, XU Weixiang. Improved frequent itemset mining algorithm based on interval list [J]. Journal of Computer Applications, 2016, 36(4): 997-1001.
[3]	YAO Rongpeng, XU Guoyan, SONG Jian. Micro-blog new word discovery method based on improved mutual information and branch entropy [J]. Journal of Computer Applications, 2016, 36(10): 2772-2776.
[4]	YANG Pengkun, PENG Hui, ZHOU Xiaofeng, SUN Yuqing. FP-MFIA: improved algorithm for mining maximum frequent itemsets based on frequent-pattern tree [J]. Journal of Computer Applications, 2015, 35(3): 775-778.
[5]	QIAN Xue-zhong, HUI Liang. Algorithm for mining maximum frequent itemsets based on decreasing dimension of frequent itemset in association rules [J]. Journal of Computer Applications, 2011, 31(05): 1339-1343.