Chinese homophonic neologism discovery method based on Pinyin similarity

doi:10.11772/j.issn.1001-9081.2022091390

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (9): 2715-2720.DOI: 10.11772/j.issn.1001-9081.2022091390

• 2022 10th CCF Conference on Big Data • Previous Articles Next Articles

Chinese homophonic neologism discovery method based on Pinyin similarity

Hanchen LI¹^,², Shunxiang ZHANG¹^,²(), Guangli ZHU¹^,², Tengke WANG¹^,²

^1.School of Computer Science and Engineering，Anhui University of Science and Technology，Huainan Anhui 232001，China
^2.Institute of Artificial Intelligence Research，Hefei Comprehensive National Science Center，Hefei Anhui 230088，China

Received:2022-09-06 Revised:2022-10-17 Accepted:2022-10-18 Online:2022-11-29 Published:2023-09-10
Contact: Shunxiang ZHANG
About author:LI Hanchen， born in 1997， M. S. candidate. His research interests include natural language processing， Web mining.
ZHU Guangli， born in 1971， M. S.， associate professor. Her research interests include Web mining， semantic search， calculation theory.
WANG Tengke， born in 1999， M. S. candidate. His research interests include natural language processing， Web mining.
Supported by:
National Natural Science Foundation of China(62076006);Anhui University Collaborative Innovation Project(GXXT-2021-008)

基于拼音相似度的中文谐音新词发现方法

李瀚臣¹^,², 张顺香¹^,²(), 朱广丽¹^,², 王腾科¹^,²

^1.安徽理工大学计算机科学与工程学院，安徽淮南 232001
^2.合肥综合性国家科学中心人工智能研究院，合肥 230088

通讯作者: 张顺香
作者简介:李瀚臣（1997—），男，安徽淮北人，硕士研究生，CCF会员，主要研究方向：自然语言处理、Web挖掘
朱广丽（1971—），女，安徽淮南人，副教授，硕士，主要研究方向：Web挖掘、语义搜索、计算理论
王腾科（1999—），男，浙江台州人，硕士研究生，主要研究方向：自然语言处理、Web挖掘。
基金资助:
国家自然科学基金资助项目(62076006);安徽高校协同创新项目(GXXT-2021-008)

Abstract

Abstract:

As one of the basic tasks of natural language processing， new word identification provides theoretical support for the establishment of Chinese dictionary and analysis of word sentiment tendency. However， the current new word identification methods do not consider the homophonic neologism identification， resulting in low precision of homophonic neologism identification. To solve this problem， a Chinese homophonic neologism discovery method based on Pinyin similarity was proposed， and the precision of homophonic neologism identification was improved by introducing the phonetic comparison of new and old words in this method. Firstly， the text was preprocessed， the Average Mutual Information （AMI） was calculated to determine the degree of internal cohesion of candidate words， and the improved branch entropy was used to determine the boundaries of candidate new words. Then， the retained words were transformed into Chinese Pinyin with similar pronunciations and compared to the Chinese Pinyin of the old words in the Chinese dictionary， and the most similar results of comparisons would be retained. Finally， if a comparison result exceeded the threshold， the new word in the result was taken as the homophonic neologism， and its corresponding word was taken as the original word. Experimental results on self built Weibo datasets show that compared with BNshCNs （Blended Numeric and symbolic homophony Chinese Neologisms） and DSSCNN （similarity computing model based on Dependency Syntax and Semantics）， the proposed method has the precision， recall and F1 score improved by 0.51 and 5.27 percentage points， 2.91 and 6.31 percentage points， 1.75 and 5.81 percentage points respectively， indicating that the proposed method has better Chinese homophonic neologism identification effect.

Key words: homophonic neologism, new word identification, Pinyin similarity, Average Mutual Information (AMI), branch entropy

摘要：

新词识别作为自然语言处理的基础任务之一，为构建中文词典、分析词语情感倾向等提供了支持。然而，目前的新词识别方法没有考虑针对谐音新词的识别，导致谐音新词识别的准确率不高。为了解决这一问题，提出一种基于拼音相似度的中文谐音新词发现方法，引入新旧词拼音比较来提高谐音新词识别的准确率。首先，对文本进行预处理，计算平均互信息（AMI）以判定候选词的内部结合度，并使用改进邻接熵确定候选新词的边界；然后，将保留下的词转换成发音相近的汉语拼音与中文词典中的旧词拼音进行相似度比较，并保留最相似的比较结果；最后，若比较结果超过阈值，则将结果中的新词作为谐音新词，对应的旧词即为谐音新词的原有词。在自建的微博数据集上的实验结果表明，与BNshCNs（Blended Numeric and symbolic homophony Chinese Neologisms）、依存句法与语义信息结合的相似性计算模型（DSSCNN）相比，所提方法的准确率、召回率和F1分数分别提高了0.51和5.27个百分点、2.91和6.31个百分点以及1.75和5.81个百分点。可见所提方法具有更好的中文谐音新词识别效果。

关键词: 谐音新词, 新词识别, 拼音相似度, 平均互信息, 邻接熵

CLC Number:

TP391

Hanchen LI, Shunxiang ZHANG, Guangli ZHU, Tengke WANG. Chinese homophonic neologism discovery method based on Pinyin similarity[J]. Journal of Computer Applications, 2023, 43(9): 2715-2720.

李瀚臣, 张顺香, 朱广丽, 王腾科. 基于拼音相似度的中文谐音新词发现方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2715-2720.

Figures/Tables 6

References 21

1	郑家恒，李文花. 基于构词法的网络新词自动识别初探［J］. 山西大学学报（自然科学版）， 2002， 25（2）：115-119.
	ZHENG J H， LI W H. A study on automatic identification for Internet new words according to word-building rule［J］. Journal of Shanxi University （Natural Science Edition）， 2002， 25（2）：115-119.
2	崔世起，刘群，孟遥，等. 基于大规模语料库的新词检测［J］. 计算机研究与发展， 2006， 43（5）：927-932. 10.1360/crad20060524
	CUI S Q， LIU Q， MENG Y， et al. New word detection based on large-scale corpus［J］. Journal of Computer Research and Development， 2006， 43（5）： 927-932. 10.1360/crad20060524
3	YANG C K， ZHU J L. New word identification algorithm in natural language processing［C］// Proceedings of the 2nd International Conference on Machine Learning， Big Data and Business Intelligence. Piscataway： IEEE， 2020： 199-203. 10.1109/mlbdbi51377.2020.00044
4	ZHU H Y， YIN X B， ZHANG S X， et al. A discovery method for new words from mobile product comments［J］. Computer Systems Science and Engineering， 2020， 35（6）： 399-410. 10.32604/csse.2020.35.399
5	王煜，徐建民. 用于网络新闻热点识别的热点新词发现［J］. 计算机应用， 2020， 40（12）：3513-3519. 10.11772/j.issn.1001-9081.2020040549
	WANG Y， XU J M. Hot new word discovery applied for detection of network hot news［J］. Journal of Computer Applications， 2020， 40（12）： 3513-3519. 10.11772/j.issn.1001-9081.2020040549
6	KIM S， SHIN H， BAEK C， et al. Learning new words from keystroke data with local differential privacy［J］. IEEE Transactions on Knowledge and Data Engineering， 2020， 32（3）： 479-491. 10.1109/tkde.2018.2885749
7	WU L， MORSTATTER F， LIU H. SlangSD： building， expanding and using a sentiment dictionary of slang words for short-text sentiment classification［J］. Language Resources and Evaluation， 2018， 52（3）： 839-852. 10.1007/s10579-018-9416-0
8	QIAN Y， DU Y， DENG X W， et al. Detecting new Chinese words from massive domain texts with word embedding［J］. Journal of Information Science， 2018， 45（2）： 196-211. 10.1177/0165551518786676
9	SHANG G H. Research on Chinese new word discovery algorithm based on mutual information［C］// Proceedings of the 2nd International Conference on Algorithms， Computing and Artificial Intelligence. New York： ACM， 2019： 580-584. 10.1145/3377713.3377785
10	CHUNG Y L， HSU P Y， HUANG S H. Num-symbolic homophonic social net-words［J］. Information， 2022， 13（4）： No.174. 10.3390/info13040174
11	赵志滨，石玉鑫，李斌阳. 基于句法分析与词向量的领域新词发现方法［J］. 计算机科学， 2019， 46（6）：29-34. 10.11896/j.issn.1002-137X.2019.06.003
	ZHAO Z B， SHI Y X， LI B Y. Newly-emerging domain word detection method based on syntactic analysis and term vector［J］. Computer Science， 2019， 46（6）： 29-34. 10.11896/j.issn.1002-137X.2019.06.003
12	张爽，陈莉，李铮. 融合相似性判断的网络新词发现算法［J］. 西北大学学报（自然科学版）， 2022， 52（2）：239-247.
	ZHANG S， CHEN L， LI Z. Internet new word detection algorithm based on similarity judgement［J］. Journal of Northwest University （Natural Science Edition）， 2022， 52（2）： 239-247.
13	ZALMOUT N， THADANI K， PAPPU A. Unsupervised neologism normalization using embedding space mapping［C］// Proceedings of the 5th Workshop on Noisy User-generated Text. Stroudsburg， PA： ACL， 2019： 425-430. 10.18653/v1/d19-5555
14	LI X D， CHEN X Y. New word discovery algorithm based on N-Gram for multi-word internal solidification degree and frequency［C］// Proceedings of the 5th International Conference on Control， Robotics and Cybernetics. Piscataway： IEEE， 2020： 51-55. 10.1109/crc51253.2020.9253493
15	ZHANG S X， XU H Q， ZHU G L， et al. A data processing method based on sequence labelling and syntactic analysis for extracting new sentiment words from product reviews［J］. Soft Computing， 2022， 26（2）： 853-866. 10.1007/s00500-021-06228-9
16	RYSKINA M， RABINOVICH E， BERG-KIRKPATRICK T， et al. Where new words are born： distributional semantic analysis of neologisms and their semantic neighborhoods［J］. Proceedings of the Society for Computation in Linguistics， 2020， 3： No.6.
17	YAN L W， BAI B， WU D O. New word extraction from Chinese financial documents［J］. IEEE Signal Processing Letters， 2017， 24（6）： 770-773. 10.1109/lsp.2017.2690599
18	SARNA G， BHATIA M P S. A probalistic approach to automatically extract new words from social media［C］// Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Piscataway： IEEE， 2016： 719-725. 10.1109/asonam.2016.7752316
19	McCRAE J P. Identification of adjective-noun neologisms using pretrained language models［C］// Proceedings of the 2019 Joint Workshop on Multiword Expressions and WordNet. Stroudsburg， PA： ACL， 2019： 135-141. 10.18653/v1/w19-5116
20	LIANG Y Z， YANG M， ZHU J， et al. Out-domain Chinese new word detection with statistics-based character embedding［J］. Natural Language Engineering， 2019， 25（2）： 239-255. 10.1017/s1351324918000463
21	WANG F. Statistic Chinese new word recognition by combing supervised and unsupervised learning［C］// Proceedings of the 2019 IEEE International Conference on Parallel and Distributed Processing with Applications， Big Data and Cloud Computing， Sustainable Computing and Communications， Social Computing and Networking. Piscataway： IEEE， 2019： 1239-1243. 10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00176

单词	音标	单词	音标
bear	［ber］	need	［ni：d］
book	［bʊk］	peach	［pi：tʃ］
duck	［dʌk］	rose	［roʊz］
five	［faɪv］	word	［wɜ：rd］

单词	音标	单词	音标
bear	［ber］	need	［ni：d］
book	［bʊk］	peach	［pi：tʃ］
duck	［dʌk］	rose	［roʊz］
five	［faɪv］	word	［wɜ：rd］

数据集	方法	准确率	召回率	F1
数据集1	BNShCNs	85.92	82.36	84.10
	DSSCNN	81.16	78.96	80.04
	本文方法	86.43	85.27	85.85
数据集2	BNShCNs	80.22	76.98	78.57
	DSSCNN	74.36	71.47	72.89
	本文方法	78.60	77.24	77.91

数据集	方法	准确率	召回率	F1
数据集1	BNShCNs	85.92	82.36	84.10
	DSSCNN	81.16	78.96	80.04
	本文方法	86.43	85.27	85.85
数据集2	BNShCNs	80.22	76.98	78.57
	DSSCNN	74.36	71.47	72.89
	本文方法	78.60	77.24	77.91

方法	准确率	召回率	F1
改进内外部统计	74.35	72.18	73.25
内外部统计+拼音相似度比较	76.31	75.04	75.67
本文方法	86.43	85.27	85.85

Chinese homophonic neologism discovery method based on Pinyin similarity

基于拼音相似度的中文谐音新词发现方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 6

References 21

Related Articles 2

Recommended Articles

Metrics

[1]	WANG Yu, XU Jianmin. Hot new word discovery applied for detection of network hot news [J]. Journal of Computer Applications, 2020, 40(12): 3513-3519.
[2]	YAO Rongpeng, XU Guoyan, SONG Jian. Micro-blog new word discovery method based on improved mutual information and branch entropy [J]. Journal of Computer Applications, 2016, 36(10): 2772-2776.