Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (12): 3711-3718.DOI: 10.11772/j.issn.1001-9081.2022121897

• Artificial intelligence • Previous Articles     Next Articles

Chinese word segmentation method in electric power domain based on improved BERT

Fei XIA1, Shuaiqi CHEN1, Min HUA2(), Bihong JIANG3   

  1. 1.College of Automation Engineering,Shanghai University of Electric Power,Shanghai 200090,China
    2.Electric Power Research Institute,State Grid Shanghai Electric Power Company,Shanghai 200437,China
    3.Library,Shanghai University of Electric Power,Shanghai 200090,China
  • Received:2022-12-26 Revised:2023-02-26 Accepted:2023-03-02 Online:2023-04-27 Published:2023-12-10
  • Contact: Min HUA
  • About author:XIA Fei, born in 1978, Ph. D., associate professor. His research interests include power data analysis, power image processing.
    CHEN Shuaiqi, born in 1997, M. S. candidate. His research interests include natural language processing.
    JIANG Bihong, born in 1981, M. S., librarian. His research interests include natural language processing, machine learning.
  • Supported by:
    State Grid Science and Technology Project(52094020001A)

基于改进BERT的电力领域中文分词方法

夏飞1, 陈帅琦1, 华珉2(), 蒋碧鸿3   

  1. 1.上海电力大学 自动化工程学院, 上海 200090
    2.国网上海电力公司 电力科学研究院, 上海 200437
    3.上海电力大学 图书馆 上海 200090
  • 通讯作者: 华珉
  • 作者简介:夏飞(1978—),男,江西南昌人,副教授,博士,CCF高级会员,主要研究方向:电力数据分析、电力图像处理
    陈帅琦(1997—),男,山东泰安人,硕士研究生,主要研究方向:自然语言处理
    华珉(1987—),男,上海人,工程师,硕士,主要研究方向:科技情报、数据管理与应用、能源数字化转型;Email:hmhzgb@163.com
    蒋碧鸿(1981—),男,广西博白人,馆员,硕士,主要研究方向:自然语言处理、机器学习。
  • 基金资助:
    国家电网科技项目(52094020001A)

Abstract:

To solve the problem of poor performance in segmenting a large number of proprietary words in Chinese text in electric power domain, an improved Chinese Word Segmentation (CWS) method in electric power domain based on improved BERT (Bidirectional Encoder Representations from Transformer) was proposed. Firstly, two lexicons were built covering general words and domain words respectively, and a dual-lexicon matching and integration mechanism was designed to directly integrate the word features into BERT model, enabling more effective utilization of external knowledge by the model. Then, DEEPNORM method was introduced to improve the model’s ability to extract features, and the optimal depth of the model was determined by Bayesian Information Criterion (BIC), which made BERT model stable up to 40 layers. Finally, the classical self-attention layer in BERT model was replaced by the ProbSparse self-attention layer, and the best value of sampling factor was determined by using Particle Swarm Optimization (PSO) algorithm to reduce the model complexity while ensuring the model performance. The test of word segmentation was carried out on a hand-labeled patent text dataset in electric power domain. Experimental results show that the proposed method achieves the F1 score of 92.87%, which is 14.70, 9.89 and 3.60 percentage points higher than those of the methods to be compared such as Hidden Markov Model (HMM), multi-standard word segmentation model METASEG (pre-training model with META learning for Chinese word SEGmentation) and Lexicon Enhanced BERT (LEBERT) model, verifying that the proposed method effectively improves the quality of Chinese text word segmentation in electric power domain.

Key words: Chinese Word Segmentation (CWS), domain word segmentation, improved BERT (Bidirectional Encoder Representations from Transformer), electric power text, deep learning, natural language processing

摘要:

针对电力领域中文文本包含大量专有词时分词效果不佳的问题,提出一种基于改进BERT (Bidirectional Encoder Representation from Transformers)的电力领域中文分词(CWS)方法。首先,构建分别涵盖通用、领域词的词典,并设计双词典匹配融合机制将词特征直接融入BERT模型,使模型更有效地利用外部知识;其次,通过引入DEEPNORM方法提高模型对于特征的提取能力,并使用贝叶斯信息准则(BIC)确定模型的最佳深度,使BERT模型稳定加深至40层;最后,采用ProbSparse自注意力机制层替换BERT模型中的经典自注意力机制层,并利用粒子群优化(PSO)算法确定采样因子的最优值,在降低模型复杂度的同时确保模型性能不变。在人工标注的电力领域专利文本数据集上进行了分词性能测试。实验结果表明,所提方法在该数据集分词任务中的F1值达到了92.87%,相较于隐马尔可夫模型(HMM)、多标准分词模型METASEG(pre-training model with META learning for Chinese word SEGmentation)与词典增强型BERT(LEBERT)模型分别提高了14.70、9.89与3.60个百分点,验证了所提方法有效提高了电力领域中文文本的分词质量。

关键词: 中文分词, 领域分词, 改进BERT, 电力文本, 深度学习, 自然语言处理

CLC Number: