Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (6): 1649-1653.DOI: 10.11772/j.issn.1001-9081.2015.06.1649

Previous Articles     Next Articles

Project keyword lexicon and keyword semantic network based on word co-occurrence matrix

WANG Qing1,2, CHEN Zeya1,2, GUO Jing1,2, CHEN Xi3, WANG Jinghua3   

  1. 1. School of Computer Science and Technology, University of Science and Technology of China, Hefei Anhui 230027, China;
    2. Suzhou Institute for Advanced Study, University of Science and Technology of China, Suzhou Jiangsu 215123, China;
    3. State Grid Information and Telecommunication Branch, Beijing 100761, China
  • Received:2015-01-13 Revised:2015-03-26 Published:2015-06-12

基于词共现矩阵的项目关键词词库和关键词语义网络

王庆1,2, 陈泽亚1,2, 郭静1,2, 陈晰3, 王晶华3   

  1. 1. 中国科学技术大学 计算机科学与技术学院, 合肥 230027;
    2. 中国科学技术大学 苏州研究院, 江苏 苏州 215123;
    3. 国家电网公司信息通信分公司, 北京 100761
  • 通讯作者: 王庆(1990-),男,山东临沂人,硕士研究生,主要研究方向:无线传感器网络、大数据与云计算;qingwang@mail.ustc.edu.cn
  • 作者简介:陈泽亚(1990-),男,上海人,硕士研究生,主要研究方向:无线传感器网络;郭静(1989-),女,安徽合肥人,硕士研究生,主要研究方向:无线传感器网络、大数据与云计算;陈晰(1980-),男,北京人,高级工程师,博士,主要研究方向:非线性系统、复杂网络;王晶华(1962-),女,北京人,高级工程师,硕士,主要研究方向:智能电网中的数据处理。

Abstract:

In order to solve the problems of keyword extraction and project keyword lexicon establishment of technological projects in professional fields, an algorithm for building the lexicon based on semantic relation and co-occurrence matrix was proposed. On the basis of conventional keyword extraction research based on co-occurrence matrix, the algorithm considered several advanced factors such as the location, property and Inverse Document Frequency (IDF) index of the keywords to improve the traditional approach. Meanwhile, a method was given for the establishment of keyword semantic network using co-occurrence matrix and hot keyword identification through computing the similarity with semantic base vector. At last, 882 project experiment documents in power field were used to perform the simulation. And the experimental results show that the proposed algorithm can effectively extract the keywords for the technological projects, establish the keyword correlation network, and has better performance in precision, recall rate and F1-score than the keyword extraction algorithm of Chinese text based on multi-feature fusion.

Key words: keyword extraction, co-occurrence matrix, keyword lexicon, keyword semantic network, power project

摘要:

针对专业领域中科技项目的关键词提取和项目词库建立的问题,提出了一种基于语义关系、利用共现矩阵建立项目关键词词库的方法。该方法在传统的基于共现矩阵提取关键词研究的基础上,综合考虑了关键词在文章中的位置、词性以及逆向文件频率(IDF)等因素,对传统算法进行改进。另外,给出一种利用共现矩阵建立关键词关联网络,并通过计算与语义基向量相似度识别热点关键词的方法。使用882篇电力项目数据进行仿真实验,实验结果表明改进后的方法能够有效对科技项目进行关键词提取,建立关键词关联网络,并在准确率、召回率以及平衡F分数(F1-score)等指标上明显优于基于多特征融合的中文文本关键词提取方法。

关键词: 关键词提取, 共现矩阵, 关键词词库, 关键词语义网络, 电力项目

CLC Number: