Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (9): 2715-2720.DOI: 10.11772/j.issn.1001-9081.2022091390

• 2022 10th CCF Conference on Big Data • Previous Articles     Next Articles

Chinese homophonic neologism discovery method based on Pinyin similarity

Hanchen LI1,2, Shunxiang ZHANG1,2(), Guangli ZHU1,2, Tengke WANG1,2   

  1. 1.School of Computer Science and Engineering,Anhui University of Science and Technology,Huainan Anhui 232001,China
    2.Institute of Artificial Intelligence Research,Hefei Comprehensive National Science Center,Hefei Anhui 230088,China
  • Received:2022-09-06 Revised:2022-10-17 Accepted:2022-10-18 Online:2022-11-29 Published:2023-09-10
  • Contact: Shunxiang ZHANG
  • About author:LI Hanchen, born in 1997, M. S. candidate. His research interests include natural language processing, Web mining.
    ZHU Guangli, born in 1971, M. S., associate professor. Her research interests include Web mining, semantic search, calculation theory.
    WANG Tengke, born in 1999, M. S. candidate. His research interests include natural language processing, Web mining.
  • Supported by:
    National Natural Science Foundation of China(62076006);Anhui University Collaborative Innovation Project(GXXT-2021-008)

基于拼音相似度的中文谐音新词发现方法

李瀚臣1,2, 张顺香1,2(), 朱广丽1,2, 王腾科1,2   

  1. 1.安徽理工大学 计算机科学与工程学院,安徽 淮南 232001
    2.合肥综合性国家科学中心 人工智能研究院,合肥 230088
  • 通讯作者: 张顺香
  • 作者简介:李瀚臣(1997—),男,安徽淮北人,硕士研究生,CCF会员,主要研究方向:自然语言处理、Web挖掘
    朱广丽(1971—),女,安徽淮南人,副教授,硕士,主要研究方向:Web挖掘、语义搜索、计算理论
    王腾科(1999—),男,浙江台州人,硕士研究生,主要研究方向:自然语言处理、Web挖掘。
  • 基金资助:
    国家自然科学基金资助项目(62076006);安徽高校协同创新项目(GXXT-2021-008)

Abstract:

As one of the basic tasks of natural language processing, new word identification provides theoretical support for the establishment of Chinese dictionary and analysis of word sentiment tendency. However, the current new word identification methods do not consider the homophonic neologism identification, resulting in low precision of homophonic neologism identification. To solve this problem, a Chinese homophonic neologism discovery method based on Pinyin similarity was proposed, and the precision of homophonic neologism identification was improved by introducing the phonetic comparison of new and old words in this method. Firstly, the text was preprocessed, the Average Mutual Information (AMI) was calculated to determine the degree of internal cohesion of candidate words, and the improved branch entropy was used to determine the boundaries of candidate new words. Then, the retained words were transformed into Chinese Pinyin with similar pronunciations and compared to the Chinese Pinyin of the old words in the Chinese dictionary, and the most similar results of comparisons would be retained. Finally, if a comparison result exceeded the threshold, the new word in the result was taken as the homophonic neologism, and its corresponding word was taken as the original word. Experimental results on self built Weibo datasets show that compared with BNshCNs (Blended Numeric and symbolic homophony Chinese Neologisms) and DSSCNN (similarity computing model based on Dependency Syntax and Semantics), the proposed method has the precision, recall and F1 score improved by 0.51 and 5.27 percentage points, 2.91 and 6.31 percentage points, 1.75 and 5.81 percentage points respectively, indicating that the proposed method has better Chinese homophonic neologism identification effect.

Key words: homophonic neologism, new word identification, Pinyin similarity, Average Mutual Information (AMI), branch entropy

摘要:

新词识别作为自然语言处理的基础任务之一,为构建中文词典、分析词语情感倾向等提供了支持。然而,目前的新词识别方法没有考虑针对谐音新词的识别,导致谐音新词识别的准确率不高。为了解决这一问题,提出一种基于拼音相似度的中文谐音新词发现方法,引入新旧词拼音比较来提高谐音新词识别的准确率。首先,对文本进行预处理,计算平均互信息(AMI)以判定候选词的内部结合度,并使用改进邻接熵确定候选新词的边界;然后,将保留下的词转换成发音相近的汉语拼音与中文词典中的旧词拼音进行相似度比较,并保留最相似的比较结果;最后,若比较结果超过阈值,则将结果中的新词作为谐音新词,对应的旧词即为谐音新词的原有词。在自建的微博数据集上的实验结果表明,与BNshCNs(Blended Numeric and symbolic homophony Chinese Neologisms)、依存句法与语义信息结合的相似性计算模型(DSSCNN)相比,所提方法的准确率、召回率和F1分数分别提高了0.51和5.27个百分点、2.91和6.31个百分点以及1.75和5.81个百分点。可见所提方法具有更好的中文谐音新词识别效果。

关键词: 谐音新词, 新词识别, 拼音相似度, 平均互信息, 邻接熵

CLC Number: