计算机应用 ›› 2016, Vol. 36 ›› Issue (3): 718-725.DOI: 10.11772/j.issn.1001-9081.2016.03.718

• 人工智能 • 上一篇    下一篇

基于词频统计的文本关键词提取方法

罗燕1,2,3, 赵书良1,2,3, 李晓超1,2,3, 韩玉辉1,2,3, 丁亚飞1,2,3   

  1. 1. 河北师范大学 数学与信息科学学院, 石家庄 050024;
    2. 河北师范大学 河北省计算数学与应用重点实验室, 石家庄 050024;
    3. 河北师范大学 移动物联网研究院, 石家庄 050024
  • 收稿日期:2015-07-24 修回日期:2015-09-16 出版日期:2016-03-10 发布日期:2016-03-17
  • 通讯作者: 赵书良
  • 作者简介:罗燕(1993-),女,河北灵寿人,硕士研究生,主要研究方向:数据挖掘、智能信息处理;赵书良(1967-),男,河北献县人,教授,博士生导师,博士,主要研究方向:数据挖掘、智能信息处理;李晓超(1987-),男,河北永年人,硕士,主要研究方向:数据挖掘、智能信息处理;韩玉辉(1989-),男,河北邢台人,硕士研究生,主要研究方向:数据挖掘、智能信息处理;丁亚飞(1988-),女,河北石家庄人,硕士研究生,主要研究方向:数据挖掘、智能信息处理。
  • 基金资助:
    国家自然科学基金资助项目(71271067);国家社会科学基金资助项目(13BTY011);国家社会科学基金重大项目(13&ZD091);河北省高等学校科学技术研究项目(QN2014196);河北师范大学硕士基金资助项目(201402002)。

Text keyword extraction method based on word frequency statistics

LUO Yan1,2,3, ZHAO Shuliang1,2,3, LI Xiaochao1,2,3, HAN Yuhui1,2,3, DING Yafei1,2,3   

  1. 1. College of Mathematics and Information Science, Hebei Normal University, Shijiazhuang Hebei 050024, China;
    2. Hebei Key Laboratory of Computational Mathematics and Applications, Hebei Normal University, Shijiazhuang Hebei 050024, China;
    3. Institute of Mobile Internet of Things, Hebei Normal University, Shijiazhuang Hebei 050024, China
  • Received:2015-07-24 Revised:2015-09-16 Online:2016-03-10 Published:2016-03-17
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (71271067), Projects of the National Social Science Foundation of China (13BTY011), Key Project of National Social Science Foundation of China (13&ZD091), Research Program of Science and Technology at Universities of Hebei Province (QN2014196), Master Foundation of Hebei Normal University (201402002).

摘要: 针对传统TF-IDF算法关键词提取效率低下及准确率欠佳的问题,提出一种基于词频统计的文本关键词提取方法。首先,通过齐普夫定律推导出文本中同频词数的计算公式;其次,根据同频词数计算公式确定文本中各频次词语所占比重,发现文本中绝大多数是低频词;最后,将词频统计规律应用于关键词提取,提出基于词频统计的TF-IDF算法。采用中、英文文本实验数据集进行仿真实验,其中推导出的同频词数计算公式平均相对误差未超过0.05;确立的各频次词语所占比重的最大误差绝对值为0.04;提出的基于词频统计的TF-IDF算法与传统TF-IDF算法相比,平均查准率、平均查全率和平均F1度量均有提高,而平均运行时间则均有降低。实验结果表明,在文本关键词提取中,基于词频统计的TF-IDF算法在查准率、查全率及F1指标上均优于传统TF-IDF算法,并能够有效减少关键词提取运行时间。

关键词: 词频统计, 齐普夫定律, 同频词, 关键词提取, TF-IDF算法

Abstract: Focused on low efficiency and poor accuracy of the traditional TF-IDF (Term Frequency-Inverse Document Frequency) algorithm in keyword extraction, a text keyword extraction method based on word frequency statistics was proposed. Firstly, the formula of the same frequency words in text was deduced according to Zipf's law; secondly, the proportion of each frequency word in text was determined in accordance with the formula of the same frequency words, most of which were low-frequency words; finally, the TF-IDF algorithm based on word frequency statistics was proposed by applying the word frequency statistics law to keyword extraction. Simulation experiments were conducted on Chinese and English text experiment data sets. The average relative error of the formula of the same frequency words was not more than 0.05; the maximum absolute error of the proportion of each frequency word in text was 0.04. Compared with the traditional TF-IDF algorithm, the average precision, the average recall and the average F1-measure of the TF-IDF algorithm based on word frequency statistics were increased respectively, while the average runtime was decreased. The simulation results show that in text keyword extraction, the TF-IDF algorithm based on word frequency statistics is superior to the traditional TF-IDF algorithm in precision, recall and F1-measure, and it can effectively reduce the runtime in keyword extraction.

Key words: word frequency statistics, Zipf's law, same frequency word, keyword extraction, Term Frequency-Inverse Document Frequency (TF-IDF) algorithm

中图分类号: