基于词频统计的文本关键词提取方法

doi:10.11772/j.issn.1001-9081.2016.03.718

计算机应用 ›› 2016, Vol. 36 ›› Issue (3): 718-725.DOI: 10.11772/j.issn.1001-9081.2016.03.718

基于词频统计的文本关键词提取方法

罗燕^1,2,3, 赵书良^1,2,3, 李晓超^1,2,3, 韩玉辉^1,2,3, 丁亚飞^1,2,3

1. 河北师范大学数学与信息科学学院, 石家庄 050024;
2. 河北师范大学河北省计算数学与应用重点实验室, 石家庄 050024;
3. 河北师范大学移动物联网研究院, 石家庄 050024

收稿日期:2015-07-24 修回日期:2015-09-16 发布日期:2016-03-17 出版日期:2016-03-10
通讯作者: 赵书良
作者简介:罗燕(1993-),女,河北灵寿人,硕士研究生,主要研究方向:数据挖掘、智能信息处理;赵书良(1967-),男,河北献县人,教授,博士生导师,博士,主要研究方向:数据挖掘、智能信息处理;李晓超(1987-),男,河北永年人,硕士,主要研究方向:数据挖掘、智能信息处理;韩玉辉(1989-),男,河北邢台人,硕士研究生,主要研究方向:数据挖掘、智能信息处理;丁亚飞(1988-),女,河北石家庄人,硕士研究生,主要研究方向:数据挖掘、智能信息处理。
基金资助:
国家自然科学基金资助项目(71271067);国家社会科学基金资助项目(13BTY011);国家社会科学基金重大项目(13&ZD091);河北省高等学校科学技术研究项目(QN2014196);河北师范大学硕士基金资助项目(201402002)。

Text keyword extraction method based on word frequency statistics

LUO Yan^1,2,3, ZHAO Shuliang^1,2,3, LI Xiaochao^1,2,3, HAN Yuhui^1,2,3, DING Yafei^1,2,3

1. College of Mathematics and Information Science, Hebei Normal University, Shijiazhuang Hebei 050024, China;
2. Hebei Key Laboratory of Computational Mathematics and Applications, Hebei Normal University, Shijiazhuang Hebei 050024, China;
3. Institute of Mobile Internet of Things, Hebei Normal University, Shijiazhuang Hebei 050024, China

Received:2015-07-24 Revised:2015-09-16 Online:2016-03-17 Published:2016-03-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (71271067), Projects of the National Social Science Foundation of China (13BTY011), Key Project of National Social Science Foundation of China (13&ZD091), Research Program of Science and Technology at Universities of Hebei Province (QN2014196), Master Foundation of Hebei Normal University (201402002).

摘要/Abstract

摘要： 针对传统TF-IDF算法关键词提取效率低下及准确率欠佳的问题,提出一种基于词频统计的文本关键词提取方法。首先,通过齐普夫定律推导出文本中同频词数的计算公式;其次,根据同频词数计算公式确定文本中各频次词语所占比重,发现文本中绝大多数是低频词;最后,将词频统计规律应用于关键词提取,提出基于词频统计的TF-IDF算法。采用中、英文文本实验数据集进行仿真实验,其中推导出的同频词数计算公式平均相对误差未超过0.05;确立的各频次词语所占比重的最大误差绝对值为0.04;提出的基于词频统计的TF-IDF算法与传统TF-IDF算法相比,平均查准率、平均查全率和平均F1度量均有提高,而平均运行时间则均有降低。实验结果表明,在文本关键词提取中,基于词频统计的TF-IDF算法在查准率、查全率及F1指标上均优于传统TF-IDF算法,并能够有效减少关键词提取运行时间。

关键词: 词频统计, 齐普夫定律, 同频词, 关键词提取, TF-IDF算法

Abstract: Focused on low efficiency and poor accuracy of the traditional TF-IDF (Term Frequency-Inverse Document Frequency) algorithm in keyword extraction, a text keyword extraction method based on word frequency statistics was proposed. Firstly, the formula of the same frequency words in text was deduced according to Zipf's law; secondly, the proportion of each frequency word in text was determined in accordance with the formula of the same frequency words, most of which were low-frequency words; finally, the TF-IDF algorithm based on word frequency statistics was proposed by applying the word frequency statistics law to keyword extraction. Simulation experiments were conducted on Chinese and English text experiment data sets. The average relative error of the formula of the same frequency words was not more than 0.05; the maximum absolute error of the proportion of each frequency word in text was 0.04. Compared with the traditional TF-IDF algorithm, the average precision, the average recall and the average F1-measure of the TF-IDF algorithm based on word frequency statistics were increased respectively, while the average runtime was decreased. The simulation results show that in text keyword extraction, the TF-IDF algorithm based on word frequency statistics is superior to the traditional TF-IDF algorithm in precision, recall and F1-measure, and it can effectively reduce the runtime in keyword extraction.

Key words: word frequency statistics, Zipf's law, same frequency word, keyword extraction, Term Frequency-Inverse Document Frequency (TF-IDF) algorithm

中图分类号:

TP391

罗燕, 赵书良, 李晓超, 韩玉辉, 丁亚飞. 基于词频统计的文本关键词提取方法[J]. 计算机应用, 2016, 36(3): 718-725.

LUO Yan, ZHAO Shuliang, LI Xiaochao, HAN Yuhui, DING Yafei. Text keyword extraction method based on word frequency statistics[J]. Journal of Computer Applications, 2016, 36(3): 718-725.

参考文献

[1] ABILHOA W D, CASTRO L N D. A keyword extraction method from twitter messages represented as graphs [J]. Applied Mathematics and Computation, 2014,240(4):308-325.
[2] CHEN Y H, LU J L, MENG F T. Finding keywords in blogs: efficient keyword extraction in blog mining via user behaviors [J]. Expert Systems with Applications, 2014,41(2):663-670.
[3] JEAN-LOUIS L, GAGNON M, CHARTON E. A knowledge-base oriented approach for automatic keyword extraction [J]. Computacin y Sistemas, 2013,17(2):187-196.
[4] HABIBI M, POPESCU-BELIS A. Keyword extraction and clustering for document recommendation in conversations [J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2015,23(4):746-759.
[5] 蒋昌金,彭宏,陈建超,等.基于组合词和同义词集的关键词提取算法[J].计算机应用研究,2010,27(8):2853-2856.(JIANG C J, PENG H, CHEN J C, et al. Keywords extraction algorithm based on combined word and synset [J]. Application Research of Computers, 2010,27(8):2853-2856.)
[6] 何炎祥,刘续乐,陈强,等.社交网络用户兴趣挖掘研究[J].小型微型计算机系统,2014,35(11):2385-2389.(HE Y X, LIU X L, CHEN Q, et al. User interest mining research based on social network service [J]. Journal of Chinese Computer Systems, 2014,35(11):2385-2389.)
[7] ZIPF G K. Human behavior and the principle of least effort: an introduction to human ecology [M]. Boston: Addison-Wesley Press, 1949:23.
[8] BOOTH A D. A law of occurrences for words of low frequency [J]. Information and Control, 1967,10(4):386-393.
[9] EGGHE L. A new short proof of Naranan's theorem, explaining Lotka's law and Zipf's law [J]. Journal of the American Society for Information Science and Technology, 2010,61(12):2581-2583.
[10] CHAN P, HIJIKATA Y, NISHIDA S. Computing semantic relatedness using word frequency and layout information of wikipedia [C]//Proceedings of the 28th Annual ACM Symposium on Applied Computing. New York: ACM, 2013:282-287.
[11] SURYASEN R, RANA M S. Content analysis and application of Zipf's law in computer science literature [C]//Proceedings of the 2015 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services. Piscataway, NJ: IEEE, 2015:223-227.
[12] ZIPF G K. Psychol [M]. Boston: Addison-Wesley Press, 1938:347-367.
[13] AGRAWAL R, GOLLAPUDI S, KANNAN A, et al. Data mining for improving textbooks [J]. ACM SIGKDD Explorations Newsletter, 2012,13(2):7-19.
[14] 冯志伟,胡凤国.数理语言学[M].北京:商务印书馆,2012:282-284.(FENG Z W, HU F G. Mathematical linguistics [M]. Beijing: The Commercial Press, 2012:282-284.)
[15] SUN Q, SHAW D, DAVIS C H. A model for estimating the occurrence of same-frequency words and the boundary between high- and low-frequency words in texts [J]. Journal of the American Society for Information Science, 1999,50(3):280-286.
[16] 何凤远.基于词频统计的齐夫定律汉语适用性研究 [D].合肥:安徽大学,2011:26-42. (HE F Y. The applicability of Zipf's law in Chinese language based on words' frequency statistics [D]. Hefei: Anhui University, 2011:26-42.)

基于词频统计的文本关键词提取方法

Text keyword extraction method based on word frequency statistics

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics

[1]	李鸣, 郭晨皓, 陈星. 视觉类深度神经网络的自动标注[J]. 计算机应用, 2020, 40(6): 1593-1600.
[2]	王庆, 陈泽亚, 郭静, 陈晰, 王晶华. 基于词共现矩阵的项目关键词词库和关键词语义网络[J]. 计算机应用, 2015, 35(6): 1649-1653.
[3]	郝宁, 夏士雄, 牛强, 赵志军. 基于类别重要度的MIMLBoost改进算法[J]. 计算机应用, 2015, 35(11): 3122-3125.
[4]	程岚岚;何丕廉;孙越恒. 基于朴素贝叶斯模型的中文关键词提取算法研究[J]. 计算机应用, 2005, 25(12): 2780-2782.
[5]	杨广翔，俞宁，谌莉. 搜索引擎结果的重排序方法[J]. 计算机应用, 2005, 25(02): 305-308.