Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (9): 2707-2714.DOI: 10.11772/j.issn.1001-9081.2022091407
• 2022 10th CCF Conference on Big Data • Previous Articles Next Articles
Yuelin TIAN1,2, Ruizhang HUANG1,2(), Lina REN1,2
Received:
2022-09-06
Revised:
2022-10-27
Accepted:
2022-11-07
Online:
2023-09-10
Published:
2023-09-10
Contact:
Ruizhang HUANG
About author:
TIAN Yuelin, born in 1997, M. S. candidate. Her research interests include natural language processing, text mining, machine learning.Supported by:
通讯作者:
黄瑞章
作者简介:
田悦霖(1997—),女,河北深州人,硕士研究生,CCF会员,主要研究方向:自然语言处理、文本挖掘、机器学习基金资助:
CLC Number:
Yuelin TIAN, Ruizhang HUANG, Lina REN. Scholar fine-grained information extraction method fused with local semantic features[J]. Journal of Computer Applications, 2023, 43(9): 2707-2714.
田悦霖, 黄瑞章, 任丽娜. 融合局部语义特征的学者细粒度信息提取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2707-2714.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2022091407
掩码方式 | 示例 |
---|---|
初始句 | 使用语言模型来预测下一个词的概率 |
典型mask方式 | 使用语言模型来预[M]下一个词的[M]率 |
WWM方式 | 使用语言模型来[M][M]下一个词的[M][M] |
Tab. 1 Examples of masking modes
掩码方式 | 示例 |
---|---|
初始句 | 使用语言模型来预测下一个词的概率 |
典型mask方式 | 使用语言模型来预[M]下一个词的[M]率 |
WWM方式 | 使用语言模型来[M][M]下一个词的[M][M] |
标签 | 说明 | 举例 |
---|---|---|
base_info | 基本信息 | 姓名:xxx 职称:特任教授,博士生导师 |
edu | 教育经历 | 2006 学士 北京邮电大学电信工程学院,北京,中国 通信工程 |
work | 工作履历 | 讲师 天津大学 计算机科学与技术学院 2011.7~2014.6 |
research | 研究方向 | 计算机视觉;计算机图形学 |
achievement | 所获成就 | 已在主流的国际期刊和会议上发表SCI/EI论文120多篇,包括IEEE/ACMTrans.论文18篇,CCF A类论文22篇, B类论文17篇。 |
publications | 学术论著 | Jing Zhang, Zhanpeng Fang, Wei Chen, and Jie Tang. Diffusion of "Following" Links in Microblogging Networks. IEEE Transaction on Knowledge and Data Engineering (TKDE), 2015, Volume 27, Issue 8, Pages 2093-2106. |
projects | 科研项目 | 天津市应用基础与前沿技术研究计划(自然科学基金)青年项目 |
awards | 获得奖项 | 第三届由田机器视觉奖创新类第二名 |
teaching | 教授课程 | 数字信号处理,本科生;计算摄像学,研究生 |
social_service | 社会任职 | IEEE Trans. SMCB, IEEE Trans. Multimedia, Pattern RecognitionLetters, The Visual Computer等相关国际期刊的审稿人 |
other | 其他内容 | 版权所有:西安交通大学 |
Tab. 2 Detailed explanation of fine-grained information on scholar homepage
标签 | 说明 | 举例 |
---|---|---|
base_info | 基本信息 | 姓名:xxx 职称:特任教授,博士生导师 |
edu | 教育经历 | 2006 学士 北京邮电大学电信工程学院,北京,中国 通信工程 |
work | 工作履历 | 讲师 天津大学 计算机科学与技术学院 2011.7~2014.6 |
research | 研究方向 | 计算机视觉;计算机图形学 |
achievement | 所获成就 | 已在主流的国际期刊和会议上发表SCI/EI论文120多篇,包括IEEE/ACMTrans.论文18篇,CCF A类论文22篇, B类论文17篇。 |
publications | 学术论著 | Jing Zhang, Zhanpeng Fang, Wei Chen, and Jie Tang. Diffusion of "Following" Links in Microblogging Networks. IEEE Transaction on Knowledge and Data Engineering (TKDE), 2015, Volume 27, Issue 8, Pages 2093-2106. |
projects | 科研项目 | 天津市应用基础与前沿技术研究计划(自然科学基金)青年项目 |
awards | 获得奖项 | 第三届由田机器视觉奖创新类第二名 |
teaching | 教授课程 | 数字信号处理,本科生;计算摄像学,研究生 |
social_service | 社会任职 | IEEE Trans. SMCB, IEEE Trans. Multimedia, Pattern RecognitionLetters, The Visual Computer等相关国际期刊的审稿人 |
other | 其他内容 | 版权所有:西安交通大学 |
超参数 | 参数值 | 超参数 | 参数值 |
---|---|---|---|
TextLength | 100 | Hidden layer nodes of CNN | 256 |
LearningRate | 1E-5 | BatchSize | 1 |
Dropout | 0.15 | Epochs | 7~18 |
CNN-KernelSize | 7 |
Tab. 3 Hyperparameter setting
超参数 | 参数值 | 超参数 | 参数值 |
---|---|---|---|
TextLength | 100 | Hidden layer nodes of CNN | 256 |
LearningRate | 1E-5 | BatchSize | 1 |
Dropout | 0.15 | Epochs | 7~18 |
CNN-KernelSize | 7 |
模型 | |||
---|---|---|---|
BERT | 83.87 | 82.46 | 82.89 |
ELECTRA | 85.26 | 85.59 | 85.33 |
RoBERTa | 87.35 | 84.82 | 85.93 |
RoBERTa-TextCNN | 86.50 | 83.47 | 84.83 |
Ours+ELECTRA-CNN | 93.16 | 92.83 | 92.96 |
Ours+RoBERTa-CNN | 93.07 | 93.84 | 93.43 |
Tab. 4 Comparison of scholar information extraction results
模型 | |||
---|---|---|---|
BERT | 83.87 | 82.46 | 82.89 |
ELECTRA | 85.26 | 85.59 | 85.33 |
RoBERTa | 87.35 | 84.82 | 85.93 |
RoBERTa-TextCNN | 86.50 | 83.47 | 84.83 |
Ours+ELECTRA-CNN | 93.16 | 92.83 | 92.96 |
Ours+RoBERTa-CNN | 93.07 | 93.84 | 93.43 |
样本类别 | P/% | R/% | F1/% | support |
---|---|---|---|---|
宏平均 | 93.07 | 93.84 | 93.43 | 44 193 |
base_info | 95.06 | 95.12 | 95.09 | 3 454 |
edu | 96.30 | 97.15 | 96.72 | 1 038 |
work | 92.89 | 95.23 | 94.02 | 1 318 |
research | 88.97 | 91.94 | 90.41 | 900 |
achievement | 83.14 | 83.43 | 83.24 | 529 |
publications | 98.39 | 98.18 | 98.28 | 6 019 |
projects | 95.60 | 94.50 | 95.01 | 1 327 |
awards | 93.88 | 95.18 | 94.01 | 890 |
teaching | 88.34 | 89.70 | 88.89 | 316 |
social_service | 93.59 | 93.42 | 93.48 | 750 |
other | 98.68 | 98.42 | 98.55 | 27 652 |
Tab. 5 Fine-grained information extraction results of scholar homepage
样本类别 | P/% | R/% | F1/% | support |
---|---|---|---|---|
宏平均 | 93.07 | 93.84 | 93.43 | 44 193 |
base_info | 95.06 | 95.12 | 95.09 | 3 454 |
edu | 96.30 | 97.15 | 96.72 | 1 038 |
work | 92.89 | 95.23 | 94.02 | 1 318 |
research | 88.97 | 91.94 | 90.41 | 900 |
achievement | 83.14 | 83.43 | 83.24 | 529 |
publications | 98.39 | 98.18 | 98.28 | 6 019 |
projects | 95.60 | 94.50 | 95.01 | 1 327 |
awards | 93.88 | 95.18 | 94.01 | 890 |
teaching | 88.34 | 89.70 | 88.89 | 316 |
social_service | 93.59 | 93.42 | 93.48 | 750 |
other | 98.68 | 98.42 | 98.55 | 27 652 |
kernel_size | kernel_size | ||
---|---|---|---|
1 | 84.23 | 11 | 92.99 |
3 | 91.01 | 13 | 93.03 |
5 | 92.40 | 15 | 92.54 |
7 | 92.71 | 17 | 92.48 |
9 | 92.34 | 19 | 92.67 |
Tab. 6 Relationship between receptive field size and model effect
kernel_size | kernel_size | ||
---|---|---|---|
1 | 84.23 | 11 | 92.99 |
3 | 91.01 | 13 | 93.03 |
5 | 92.40 | 15 | 92.54 |
7 | 92.71 | 17 | 92.48 |
9 | 92.34 | 19 | 92.67 |
模型 | |||
---|---|---|---|
No-pooling | 93.68 | 91.79 | 92.71 |
+maxpooling | 93.22 | 92.01 | 92.58 |
+avgpooling | 93.49 | 91.44 | 92.39 |
Tab. 7 Effect comparison of pooling layer
模型 | |||
---|---|---|---|
No-pooling | 93.68 | 91.79 | 92.71 |
+maxpooling | 93.22 | 92.01 | 92.58 |
+avgpooling | 93.49 | 91.44 | 92.39 |
模型 | 数据集 | |||
---|---|---|---|---|
RoBERTa- TextCNN | waimai_10k | 90.32 | 89.70 | 90.00 |
NLPCC2014 | 48.56 | 61.06 | 53.03 | |
toutiaonews38w | 82.85 | 82.81 | 82.83 | |
Ours+ RoBERTa-CNN | waimai_10k | 90.40 | 89.82 | 90.09 |
NLPCC2014 | 52.13 | 59.48 | 55.17 | |
toutiaonews38w | 82.84 | 82.82 | 82.83 |
Tab. 8 Experimental results of universality of different models on different datasets
模型 | 数据集 | |||
---|---|---|---|---|
RoBERTa- TextCNN | waimai_10k | 90.32 | 89.70 | 90.00 |
NLPCC2014 | 48.56 | 61.06 | 53.03 | |
toutiaonews38w | 82.85 | 82.81 | 82.83 | |
Ours+ RoBERTa-CNN | waimai_10k | 90.40 | 89.82 | 90.09 |
NLPCC2014 | 52.13 | 59.48 | 55.17 | |
toutiaonews38w | 82.84 | 82.82 | 82.83 |
1 | Miniwatts Marketing Group. World Internet usage and population statistics 2022 year-Q1 estimates [EB/OL]. [2022-06-20]. ,%209.9%20%25%20%205% 20more%20rows%20. |
2 | CHANG C H, KAYED M, GIRGIS M R, et al. A survey of Web information extraction systems[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(10): 1411-1428. 10.1109/tkde.2006.152 |
3 | KARLSSON C, HAMMARFELT B. David Audretsch — a bibliometric portrait of a distinguished entrepreneurship scholar[M]// LEHMANN E K, KEILBACH M. From Industrial Organization to Entrepreneurship: A Tribute to David B. Audretsch. Cham: Springer, 2019: 169-192. 10.1007/978-3-030-25237-3_18 |
4 | CHEN Z X, DING J P, ZHOU Z G, et al. Application of association rule mining in talent introduction analysis[J]. Science Journal of Applied Mathematics and Statistics, 2019, 7(3): 45-50. 10.11648/j.sjams.20190703.13 |
5 | 孙玉涛,张艺蕾. 海外人才引进计划提升了我国大学科研产出吗?——以“211”工程大学化学学科为例[J]. 科研管理, 2021, 42(10):20-27. |
SUN Y T, ZHANG Y L. Does Overseas Talent-Attracting Program increase the research output of Chinese universities? — a study by taking the chemistry discipline of the universities of the "211 Project"[J]. Science Research Management, 2021, 42(10): 20-27. | |
6 | CUI Y M, CHE W X, LIU T, et al. Pre-training with whole word masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514. 10.1109/taslp.2021.3124365 |
7 | SAID W, HASSAN M M, FAWZY A M. Smart search methods in expert database systems[J]. International Journals SSRG, 2018, 66(1): 24-29. 10.14445/22315381/ijett-v66p205 |
8 | SUN F, SONG D D, LIAO L J. DOM based content extraction via text density[C]// Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2011: 245-254. 10.1145/2009916.2009952 |
9 | SONG D D, SUN F, LIAO L J. A hybrid approach for content extraction with text density and visual importance of DOM nodes[J]. Knowledge and Information Systems, 2015, 42(1): 75-96. 10.1007/s10115-013-0687-x |
10 | FANG Y X, XIE X Q, ZHANG X F, et al. STEM: a suffix tree-based method for Web data records extraction[J]. Knowledge and Information Systems, 2018, 55(2): 305-331. 10.1007/s10115-017-1062-0 |
11 | YU X, JIN Z P. Web content information extraction based on DOM tree and statistical information[C]// Proceedings of the IEEE 17th International Conference on Communication Technology. Piscataway: IEEE, 2017: 1308-1311. 10.1109/icct.2017.8359846 |
12 | WANG R J, ZHANG Y S, HOU Z Y, et al. Webpage text extraction algorithm based on text block density and tag path features[C]// Proceedings of the SPIE 12330, 2022 International Conference on Cyber Security, Artificial Intelligence, and Digital Economy. Bellingham, WA: SPIE, 2022: No.123301G. 10.1117/12.2646343 |
13 | BARDUCCI A, IANNACCONE S, LA GATTA V, et al. An end-to-end framework for information extraction from Italian resumes[J]. Expert Systems with Applications, 2022, 210: No.118487. 10.1016/j.eswa.2022.118487 |
14 | YU B W, DU J P, SHAO Y X. Web page content extraction based on multi-feature fusion[EB/OL]. [2022-06-16]. . |
15 | CAI D, YU S P, WEN J R, et al. VIPS: a vision-based page segmentation algorithm: MSR-TR-2003-79[R/OL]. (2003-11-01) [2022-07-23].. |
16 | ZELENY J, BURGET R, ZENDULKA J. Box clustering segmentation: a new method for vision-based web page preprocessing[J]. Information Processing and Management, 2017, 53(3): 735-750. 10.1016/j.ipm.2017.02.002 |
17 | PU J C, LIU J, WANG J. A vision-based approach for Deep Web form extraction[C]// Proceedings of the 2017 International Conference on Future Information Technology/ 2017 International Conference on Multimedia and Ubiquitous Engineering, LNEE 448. Singapore: Springer, 2017: 696-702. 10.1007/978-981-10-5041-1_111 |
18 | 陈晓雷. 自适应Web数据抽取技术研究[D]. 沈阳:辽宁大学, 2016:1-7. |
CHEN X L. Research on technique of self-adaptive Web data extraction[D]. Shenyang: Liaoning University, 2016:1-7. | |
19 | PATNAIK S K, BABU C N, BHAVE M. Intelligent and adaptive Web data extraction system using convolutional and long short-term memory deep learning networks[J]. Big Data Mining and Analytics, 2021, 4(4): 279-297. 10.26599/bdma.2021.9020012 |
20 | 梅雪,程学旗,郭岩,等. 一种全自动生成网页信息抽取Wrapper的方法[J]. 中文信息学报, 2008, 22(1):22-29. 10.3969/j.issn.1003-0077.2008.01.004 |
MEI X, CHENG X Q, GUO Y, et al. Fully automatic wrapper generation for Web information extraction[J]. Journal of Chinese Information Processing, 2008, 22(1):22-29. 10.3969/j.issn.1003-0077.2008.01.004 | |
21 | 顾韵华,高原,高宝,等. 基于模板和领域本体的Deep Web信息抽取研究[J]. 计算机工程与设计, 2014, 35(1):327-332. 10.3969/j.issn.1000-7024.2014.01.061 |
GU Y H, GAO Y, GAO B, et al. Research on Deep Web information extraction based on template and domain ontology[J]. Computer Engineering and Design, 2014, 35(1): 327-332. 10.3969/j.issn.1000-7024.2014.01.061 | |
22 | 郭少华,郭岩,李海燕,等. 可扩展的网页关键信息抽取研究[J]. 中文信息学报, 2015, 29(1):97-103. 10.3969/j.issn.1003-0077.2015.01.013 |
GUO S H, GUO Y, LI H Y, et al. Research on extensible web key information extraction[J]. Journal of Chinese Information Processing, 2015, 29(1): 97-103. 10.3969/j.issn.1003-0077.2015.01.013 | |
23 | LI J, LU Y M, ZHANG X. Extracting news information based on webpage segmentation and parsing DOM tree reversely[C]// Proceedings of the 2014 International Conference on Trustworthy Computing and Services, CCIS 520. Berlin: Springer, 2015: 48-55. 10.1007/978-3-662-47401-3_7 |
24 | 张秋颖,傅洛伊,王新兵. 基于BERT-BiLSTM-CRF的学者主页信息抽取[J]. 计算机应用研究, 2020, 37(S1):47-49. |
ZHANG Q Y, FU L Y, WANG X B. Scholar homepage information extraction based on BERT-BiLSTM-CRF[J]. Application Research of Computers, 2020, 37(S1):47-49. | |
25 | ZHOU Y C, SHENG Y, VO N, et al. Simplified DOM trees for transferable attribute extraction from the web[EB/OL]. (2021-01-07) [2022-06-17].. 10.1145/3488560.3498424 |
26 | WANG Q F, FANG Y, RAVULA A, et al. WebFormer: the Web-page transformer for structure information extraction[C]// Proceedings of the ACM Web Conference 2022. New York: ACM, 2022: 3124-3133. 10.1145/3485447.3512032 |
27 | PUTRA EKA PRISMANA G L. Automatic Web news content extraction[J]. Journal Research of Social, Science, Economics, and Management, 2022, 1(7): 785-794. 10.36418/jrssem.v1i7.107 |
28 | ALARTE J, SILVA J. HybEx: a hybrid tool for template extraction[C]// Companion Proceedings of the Web Conference 2022. New York: ACM, 2022: 205-209. 10.1145/3487553.3524242 |
29 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: ACL, 2019: 4171-4186. 10.18653/v1/n18-2 |
30 | CUI Y M, CHE W X, LIU T, et al. Revisiting pre-trained models for Chinese natural language processing[C]// Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg, PA: ACL, 2020: 657-668. 10.18653/v1/2020.findings-emnlp.58 |
31 | ZHANG Y, WALLACE B. A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification[C]// Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). [S.l.]: Asian Federation of Natural Language Processing, 2017: 253-263. 10.18653/v1/d16-1076 |
32 | XU Z. RoBERTa-wwm-ext fine-tuning for Chinese text classification[EB/OL]. (2021-02-24) [2022-05-13].. |
[1] | Xin YANG, Xueni CHEN, Chunjiang WU, Shijie ZHOU. Short-term traffic flow prediction of urban highway based on variant residual model and Transformer [J]. Journal of Computer Applications, 2024, 44(9): 2947-2951. |
[2] | Shuai FU, Xiaoying GUO, Ruyi BAI, Tao YAN, Bin CHEN. Age estimation method combining improved CloFormer model and ordinal regression [J]. Journal of Computer Applications, 2024, 44(8): 2372-2380. |
[3] | Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion [J]. Journal of Computer Applications, 2024, 44(8): 2407-2413. |
[4] | Chenyang LI, Long ZHANG, Qiusheng ZHENG, Shaohua QIAN. Multivariate controllable text generation based on diffusion sequences [J]. Journal of Computer Applications, 2024, 44(8): 2414-2420. |
[5] | Wudan LONG, Bo PENG, Jie HU, Ying SHEN, Danni DING. Road damage detection algorithm based on enhanced feature extraction [J]. Journal of Computer Applications, 2024, 44(7): 2264-2270. |
[6] | Ruihua LIU, Zihe HAO, Yangyang ZOU. Gait recognition algorithm based on multi-layer refined feature fusion [J]. Journal of Computer Applications, 2024, 44(7): 2250-2257. |
[7] | Zhengyu ZHAO, Jing LUO, Xinhui TU. Information retrieval method based on multi-granularity semantic fusion [J]. Journal of Computer Applications, 2024, 44(6): 1775-1780. |
[8] | Zhihao WU, Ziqiu CHI, Ting XIAO, Zhe WANG. Meta-learning adaption for few-shot text-to-speech [J]. Journal of Computer Applications, 2024, 44(5): 1629-1635. |
[9] | Chenhui CUI, Suzhen LIN, Dawei LI, Xiaofei LU, Jie WU. Infrared dim small target tracking method based on Siamese network and Transformer [J]. Journal of Computer Applications, 2024, 44(2): 563-571. |
[10] | Wenjie YAN, Dongyue DANG. Broad quantum state tomography model based on adaptive feature extraction [J]. Journal of Computer Applications, 2024, 44(12): 3861-3866. |
[11] | Yiyang FAN, Yang ZHANG, Shang ZENG, Yu ZENG, Maoli FU. Multivariate long-term series forecasting model based on decomposition and frequency domain feature extraction [J]. Journal of Computer Applications, 2024, 44(11): 3442-3448. |
[12] | Pei ZHAO, Yan QIAO, Rongyao HU, Xinyu YUAN, Minyue LI, Benchu ZHANG. Multivariate time series anomaly detection based on multi-domain feature extraction [J]. Journal of Computer Applications, 2024, 44(11): 3419-3426. |
[13] | Tao LIU, Shihong JU, Yimeng GAO. Small object detection algorithm from drone perspective based on improved YOLOv8n [J]. Journal of Computer Applications, 2024, 44(11): 3603-3609. |
[14] | Xiang LIN, Biao JIN, Weijing YOU, Zhiqiang YAO, Jinbo XIONG. Model integrity verification framework of deep neural network based on fragile fingerprint [J]. Journal of Computer Applications, 2024, 44(11): 3479-3486. |
[15] | Xiaoyu HUA, Dongfen LI, You FU, Kejun BI, Shi YING, Ruijin WANG. Industrial chain risk assessment and early warning model combining hierarchical graph neural network and long short-term memory [J]. Journal of Computer Applications, 2024, 44(10): 3223-3231. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||