Focused topic Web crawler based on improved TF-IDF alogorithm

doi:10.11772/j.issn.1001-9081.2015.10.2901

Abstract

Abstract: Considering a large number of irrelevant data in Web search results and low accuracy of semantic retrieval by using the traditional TF-IDF algorithm, K-means algorithm and the adaptive genetic algorithm, the improvement of the TF-IDF algorithm and its application in semantic retrieval were studied. The TF-IDF algorithm was improved successfully by applying the regular expression to the semantic analysis technique. The search topic was described by a semantic database. The similarity of the regular atoms in the documents was obtained by a weighted calculation, which was according to the importance of the regular atomic semantics and the different positions in the Web pages. The final results were obtained by a Cosine operation of the document similarity and subject mode through the space vector model. Finally, the calculating results were analyzed by applying the improved TF-IDF algorithm, the traditional TF-IDF algorithm, the K-means algorithm and the adaptive genetic algorithm to the focused topic Web crawler. The results show that the accuracy of the improved TF-IDF algorithm rose by 17.1 percentage points and the omission rate of that reduced by 7.76 percentage points in the vertical search of the focused topic web crawler. Compared with the K-means algorithm and the adaptive genetic algorithm, the accuracy of the improved TF-IDF algorithm rose by 6 percentage points and 8.1 percentage points, respectively. In summary, the improved TF-IDF algorithm can promote the accuracy of document similarity detection effectively and improve the defect of focused topic web crawler in the semantic analysis greatly.

Key words: Web spider, semantic analysis, search engine, Term Frequency-Inverse Document Frequency (TF-IDF), title spider, document correlation degree

摘要： 针对传统的TF-IDF算法、K-means算法、自适应遗传算法在网络检索结果中含有大量不相关数据、语义检索准确性不高的问题,研究了TF-IDF算法的改进及其在语义检索中的应用。将正则表达式和语义分析技术相结合,从而实现对TF-IDF算法的改进。利用语义库对搜索主题进行描述,根据正则原子语义的重要性和在网页标签中的不同位置进行加权计算,得到正则原子在文档中的相似度。通过空间向量模型对文档相似度和主题模型进行余弦运算,从而获取最终的搜索结果。最后,将改进的TF-IDF算法、传统的TF-IDF算法、K-means算法和自适应遗传算法运用于聚焦主题网络爬虫中,对其检索结果进行了对比分析。计算结果表明,在聚焦主题网络爬虫语义分析的垂直搜索中,改进TF-IDF算法的相似度准确率比传统的TF-IDF算法检索准确率提高了17.1个百分点,遗漏率降低了7.76个百分点;比K-means算法检索准确率提高6个百分点;比自适应遗传算法检索准确率提高了8.1个百分点。总之,改进的TF-IDF算法可以有效地提高文档相似度检测的准确率,很好地改善聚焦主题网络爬虫在语义分析中的缺陷。

关键词: 网络爬虫, 语义分析, 搜索引擎, TF-IDF, 主题爬虫, 文档相似度

CLC Number:

TP311

WANG Jingzhong, QIU Tongxiang. Focused topic Web crawler based on improved TF-IDF alogorithm[J]. Journal of Computer Applications, 2015, 35(10): 2901-2904.

王景中, 邱铜相. 基于TF-IDF改进算法的聚焦主题网络爬虫[J]. 计算机应用, 2015, 35(10): 2901-2904.

References

[1] DAI K, ZHAO H, HAN D, et al. Theme feature extraction of Chinese webpage based on vector space model[J]. Journal of Jilin University: Information Science, 2014,31(1):89-93.(代宽,赵辉,韩冬,等. 基于向量空间模型的中文网页主题特征项抽取[J].吉林大学学报:信息科学版,2014,31(1):89-93.)
[2] LI D, LIAO X, FAN F, et al. A focused network crawler with topic knowledge automatically growing[J]. Computer Applications and Software, 2014,31(5):30-33.(李东晖, 廖晓兰, 范辅桥,等. 一种主题知识自增长的聚焦网络爬虫[J]. 计算机应用与软件, 2014,31(5):30-33.)
[3] LU Y, LI Y. Improvement of text feature weighting method based on TF-IDF algorithm[J]. Library and Information Service, 2013,57(3):89-94.(路永, 李焰锋.改进TF-IDF算法的文本特征项权值计算方法[J].图书情报工作, 2013,57(3):89-94.)
[4] QIU Y, ZHAO B, LIN M, et al. Improved k-means clustering algorithm combined semantic similarity of short text[J/OL]. [2015-05-01].Computer Engineering and Applications, http://www.cnki.net/kcms/detail/11.2127.TP.20150624.1129.028.html.(邱云飞, 赵彬, 林明明,等. 结合语义改进的k-means 短文本聚类算法[J/OL]. [2015-05-01].计算机工程与应用, http://www.cnki.net/kcms/detail/11.2127.TP.20150624.1129.028.html.)
[5] HUANG C, YIN J, HOU F. A text similarity measurement combining word semantic information with TF-IDF method[J]. Chinese Journal of Computers, 2011,34(5):857-862.(黄承慧, 印鉴, 侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报, 2011,34(5):857-862.)
[6] SUN Z, ZHENG Q, YUAN J, et al. Semantic retrieval based on shallow semantic analysis technology[J]. Computer Science, 2012,39(6):107-110.(孙志军,郑烇,袁婧,等. 基于浅层语义分析技术的语义检索[J].计算机科学,2012,39(6):107-110.)
[7] SCHUBER F, LI H. Chinese word segmentaction and its effect on information retriveal[J]. Information Processing and Management,2004,40(1):161-190.
[8] CHENG X, LI Y. An ontology-based semantic extraction method of heterogeneous data [J]. Computer and Modernization, 2014(6):2-6.(成欣, 李扬. 一种基于本体的异构数据语义抽取方法[J]. 计算机与现代化, 2014(6):2-6.)
[9] YU J J Q, LI V O K. A social spider algorithm for global optimization[EB/OL]. [2015-04-10]. http://arxiv.org/pdf/1502.02407v1.pdf.
[10] CHEN Y, CHEN Y, YANG Y, et al. Design and research on search strategy of focused crawler based on genetic algorithm[J]. Journal of Chengdu University of Information Technology, 2011,26(5):534-537. (陈悦,陈运,杨义先,等.基于遗传算法的聚焦爬虫搜索策略设计与研究[J].成都信息工程学院学报,2011,26(5):534-537.)
[11] YU H. Page feature extraction technology research[J]. Journal of Shandong University of Technology:Science and Technology, 2011,25(2):108-110.(于洪波. 网页特征提取技术研究[J].山东理工大学学报:自然科学版,2011,25(2):108-110.)
[12] HE F, HE Y, LIU N, et al. A microblog short text oriented multi-class feature extraction method of fine-grained sentiment analysis[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2014,50(1):48-54.(贺飞艳,何炎祥,刘楠,等.面向微博短文本的细粒度情感特征抽取方法[J].北京大学学报:自然科学版,2014,50(1):48-54.)

[1]	LI Yong, XIANG Zhongqi. Ciphertext retrieval ranking method based on counting Bloom filter [J]. Journal of Computer Applications, 2018, 38(9): 2554-2559.
[2]	YANG Hongyu, WANG Yue. Multi-keyword ciphertext search method in cloud storage environment [J]. Journal of Computer Applications, 2018, 38(2): 343-347.
[3]	ZHANG Ning, CHEN Qin. P2P loan default prediction model based on TF-IDF algorithm [J]. Journal of Computer Applications, 2018, 38(10): 3042-3047.
[4]	LIANG Rui, ZHU Qingxin, LIAO Shujiao, NIU Xinzheng. Deep natural language description method for video based on multi-feature fusion [J]. Journal of Computer Applications, 2017, 37(4): 1179-1184.
[5]	ZHANG Shuowang, OUYANG Chunping, YANG Xiaohua, LIU Yongbin, LIU Zhiming. Word semantic similarity computation based on integrating HowNet and search engines [J]. Journal of Computer Applications, 2017, 37(4): 1056-1060.
[6]	HUANG Wei, LIN Jie, JIANG Yu'e. Improved automatic classification algorithm of software bug report in cloud environment [J]. Journal of Computer Applications, 2016, 36(5): 1212-1215.
[7]	LUO Yan, ZHAO Shuliang, LI Xiaochao, HAN Yuhui, DING Yafei. Text keyword extraction method based on word frequency statistics [J]. Journal of Computer Applications, 2016, 36(3): 718-725.
[8]	MA Jianhong, ZHANG Mingyue, ZHAO Yanan. Patent knowledge extraction method for innovation design [J]. Journal of Computer Applications, 2016, 36(2): 465-471.
[9]	BAI Xiaohong, WEN Jing, ZHAO Xue, CHEN Jinguang. target tracking algorithm based on the speeded up robust features and multi-instance learning [J]. Journal of Computer Applications, 2016, 36(11): 2974-2978.
[10]	HU Qinghui, WEI Shiwei, XIE Zhongqian, REN Yafeng. Correlation between phrases in advertisement based on recursive autoencoder [J]. Journal of Computer Applications, 2016, 36(1): 154-157.
[11]	HU Yang, DAI Dan, LIU Li, FENG Xupeng, LIU Lijun, HUANG Qingsong. Classification method of text sentiment based on emotion role model [J]. Journal of Computer Applications, 2015, 35(5): 1310-1313.
[12]	YE Zhonglin, YANG Yan, JIA Zhen, YIN Hongfeng. Short question classification based on semantic extensions [J]. Journal of Computer Applications, 2015, 35(3): 792-796.
[13]	LI Zhenjun, ZHOU Zhurong. Improvement of term frequency-inverse document frequency algorithm based on Document Triage [J]. Journal of Computer Applications, 2015, 35(12): 3506-3510.
[14]	HAO Ning, XIA Shixiong, NIU Qiang, ZHAO Zhijun. Improved MIMLBoost algorithm based on importance evaluation of labels [J]. Journal of Computer Applications, 2015, 35(11): 3122-3125.
[15]	LIN Jianghao, ZHOU Yongmei, YANG Aimin, CHEN Yuhong, CHEN Xiaofan. Analysis of public emotion evolution based on probabilistic latent semantic analysis [J]. Journal of Computer Applications, 2015, 35(10): 2747-2751.