Journal of Computer Applications ›› 2014, Vol. 34 ›› Issue (10): 2865-2868.DOI: 10.11772/j.issn.1001-9081.2014.10.2865

Extracting method of emergency news headline and text from webpages

LUO Yonglian,ZHAO Changyuan   

  1. School of Information Technology and Engineering, Jinzhong University, Jinzhong Shanxi 030619, China
  • Received:2014-05-08 Revised:2014-06-18 Online:2014-10-30 Published:2014-10-01
  1. 晋中学院 信息技术与工程学院,山西 晋中 030619
  • 通讯作者: 罗永莲
  • 作者简介:罗永莲(1973-),女,山西晋中人,副教授,硕士,主要研究方向:中文信息处理、教育测量理论;赵昌垣(1963-),男,山西晋中人,高级讲师,主要研究方向:数据库、软件工程管理。
Concerning the processing of emergency news webpages corpora, an news content extracting and locating method based on the characteristics of emergency news and webpage tags was proposed. By taking webpage tags and text similarity as the features of machine learning, this method extracted the news headlines based on the Bayes method. Meanwhile, the method reduced text processing quantity and dimensionality of text vector based on the stability of emergency news words and nesting of webpage tags, so that it calculated similarity of vector to locate the news beginning and ending. The experimental results show that this method extracts news headlines with an 86.5% accuracy rate and extracts news texts with an average accuracy rate of more than 78%. The proposed method is effective and efficient. It has certain reference for mining webpage tags and own information of text on webpages.



