[1] COWIE J,LEHNERT W.Information extraction[J].Communications of the ACM,1996,39(1):80-91. [2] MOONEY R J,BUNESCU R.Mining knowledge from text using information extraction[J].ACM SIGKDD Explorations Newsletter,2005,7(1):3-10. [3] CHANG C-H,LUI S-C.IEPAD:information extraction based on pattern discovery[C]//WWW'01:Proceedings of the 10th International Conference on World Wide Web.New York:ACM,2001:681-688. [4] BANKO M,CAFARELLA M J,SODERLAND S,et al.Open information extraction from the Web[C]//IJCAI 2007:Proceedings of the 20th International Joint Conference on Artificial Intelligence.Menlo Park,CA:AAAI Press,2007:2670-2676. [5] BAUMGARTNER R,FLESCA S,GOTTLOB G.Visual Web information extraction with Lixto[C]//VLDB'01:Proceedings of the 27th International Conference on Very Large Data Bases.San Francisco,CA:Morgan Kaufmann,2001:119-128. [6] 孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5):17-22.(SUN C J,GUAN Y.A statistical approach for content extraction from Web page[J].Journal of Chinese Information Processing,2004,18(5):17-22.) [7] 赵欣欣,索红光,刘玉树.基于标记窗的网页正文信息提取方法[J].计算机应用研究,2007,24(3):144-145.(ZHANG X X,SUO H G,LIU Y S.Web content information extraction method based on tag window[J].Application Research of Computers,2007,24(3):144-145.) [8] 王磊,蒋建中,郭军利.基于扩展DOM树的Web页面信息抽取[J].计算机应用与软件,2007,24(6):137-139.(WANG L,JIANG J Z,GUO J L.Information extraction from Web page based on extended DOM tree[J].Computer Applications and Software,2007,24(6):137-139.) [9] GOTTLOB G,KOCH C.Logic-based Web information extraction[J].ACM SIGMOD Record,2004,33(2):87-94. [10] 梅雪,程学旗,郭岩,等.一种全自动生成网页信息抽取Wrapper的方法[J].中文信息学报,2008,22(1):22-29.(MEI X,CHENG X Q,GUO Y.Fully automatic wrapper generation for Web information extraction[J].Journal of Chinese Information Processing,2008,22(1):22-29.) [11] 宋明秋,张瑞雪,吴新涛,等.网页正文信息抽取新方法[J].大连理工大学学报,2009,49(4):594-597.(SONG M Q,ZHANG R X,WU X T,et.al.A new approach to content extraction from Web page[J].Journal of Dalian University of Technology,2009,49(4):594-597.) [12] 周佳颖,朱珍民,高晓芳.基于统计与正文特征的中文网页正文抽取研究[J].中文信息学报,2009,23(5):80-85.(ZHOU J Y,ZHU Z M,GAO X F.Research on content extraction from Chinese Web page based on statistic and content-features[J].Journal of Chinese Information Processing,2009,23(5):80-85.) [13] 张瑞雪,宋明秋,公衍磊.逆序解析DOM树及网页正文信息提取[J].计算机科学,2011,38(4):213-215.(ZHANG R X,SONG M Q,GONG Y L.Parsing DOM tree reversely and extracting Web main page information[J].Computer Science,2011,38(4):213-215.) [14] DEBNATH S,MITRA P,PAL N,et al.Automatic identification of informative sections of Web pages[J].IEEE Transactions on Knowledge and Data Engineering,2005,17(9):1233-1246. [15] 杨柳青,李晓东,耿光刚.基于布局相似性的网页正文内容提取研究[J].计算机应用研究,2015,32(9):2581-2586.(YANG L Q,LI X D,GENG G G.Study of Web pages content extraction based on layout similarity[J].Application Research of Computers,2015,32(9):2581-2586.) [16] 丁宝琼,谢远平,吴琼.基于改进DOM树的网页去噪声方法[J].计算机应用,2009,29(6):175-177.(DING B Q,XIE Y P,WU Q.Noise elimination method in Web page based on improved DOM tree[J].Journal of Computer Applications,2009,29(6):175-177.) [17] 邵俊.基于视觉热区的网页内容抽取方法[J].计算机应用与软件,2012,29(6):199-201.(SHAO J.Web pages content extraction based on visual hot zone[J].Computer Applications and Software,2012,29(6):199-201.) [18] 李霞,蒋盛益.基于DOM树及行文本统计去噪的网页文本抽取技术[J].山东大学学报(理学版),2012,47(3):38-42.(LI X,JIANG S Y.Content extraction from Web page based on the DOM tree and line-text statistical noise-elimination[J].Journal of Shandong University (Natural Science),2012,47(3):38-42.) [19] 吴麒,陈兴蜀,谭骏.基于权值优化的网页正文内容提取算法[J].华南理工大学学报(自然科学版),2011,39(4):32-37.(WU L, CHEN X S, TAN J. Content extraction algorithm of HTML pages based on optimized weight[J]. Journal of South China University of Technology (Natural Science Edition), 2011, 39(4):32-37.) [20] Jsoup[EB/OL].[2015-12-03]. http://jsoup.org/. [21] 中国互联网络信息中心.第35次中国互联网络发展状况统计报告[R/OL].[2015-02-03]. http://www.cnnic.cn/hlwfzyj/hlwxzbg/201502/P020150203551802054676.pdf. (China Internet Network Information Center. The 35th development statistics report of china internet network[R/OL].[2015-02-03]. http://www.cnnic.cn/hlwfzyj/hlwxzbg/201502/P020150203551802054676.pdf.) [22] Readability. Read comfortably-anytime, anywhere[EB/OL].[2015-12-05]. https://www.readability.com/. [23] Newspaper:Article scraping & curation[EB/OL].[2015-11-06]. http://newspaper.readthedocs.org/en/latest/. |