计算机应用 ›› 2014, Vol. 34 ›› Issue (10): 2865-2868.DOI: 10.11772/j.issn.1001-9081.2014.10.2865

• 人工智能 • 上一篇    下一篇

突发事件新闻标题与正文提取方法

罗永莲,赵昌垣   

  1. 晋中学院 信息技术与工程学院,山西 晋中 030619
  • 收稿日期:2014-05-08 修回日期:2014-06-18 出版日期:2014-10-01 发布日期:2014-10-30
  • 通讯作者: 罗永莲
  • 作者简介:罗永莲(1973-),女,山西晋中人,副教授,硕士,主要研究方向:中文信息处理、教育测量理论;赵昌垣(1963-),男,山西晋中人,高级讲师,主要研究方向:数据库、软件工程管理。
  • 基金资助:

    山西省高等学校教学改革项目;山西省教育科学“十一五”规划课题

Extracting method of emergency news headline and text from webpages

LUO Yonglian,ZHAO Changyuan   

  1. School of Information Technology and Engineering, Jinzhong University, Jinzhong Shanxi 030619, China
  • Received:2014-05-08 Revised:2014-06-18 Online:2014-10-01 Published:2014-10-30
  • Contact: LUO Yonglian

摘要:

针对突发事件新闻网页语料处理问题,提出了一种基于此类新闻特点与网页标记信息的抽取和定位新闻内容的方法。该方法将网页标记与文本相似度作为机器学习的特征项,利用贝叶斯分类方法提取新闻标题。利用事件新闻的用词稳定性与网页标记的嵌套特点,减少了文本处理数量,降低了文本向量维数,在此基础上计算向量相似度以定位新闻篇首与篇尾。实验结果表明,该方法抽取标题的准确率达到86.5%,抽取正文的平均准确率在78%以上,能有效抽取新闻内容,且易于实现,对其他网页文本处理中挖掘标记信息与文本自身信息具有一定的借鉴意义。

Abstract:

Concerning the processing of emergency news webpages corpora, an news content extracting and locating method based on the characteristics of emergency news and webpage tags was proposed. By taking webpage tags and text similarity as the features of machine learning, this method extracted the news headlines based on the Bayes method. Meanwhile, the method reduced text processing quantity and dimensionality of text vector based on the stability of emergency news words and nesting of webpage tags, so that it calculated similarity of vector to locate the news beginning and ending. The experimental results show that this method extracts news headlines with an 86.5% accuracy rate and extracts news texts with an average accuracy rate of more than 78%. The proposed method is effective and efficient. It has certain reference for mining webpage tags and own information of text on webpages.

中图分类号: