Journal of Computer Applications ›› 0, Vol. ›› Issue (): 55-60.DOI: 10.11772/j.issn.1001-9081.2024071061

• Artificial intelligence • Previous Articles     Next Articles

Robust few-shot enterprise risk identification method by integrating knowledge graph and contrastive learning

Xiaobin LYU, Haosen HUANG, Xin ZHOU, Jinlai WANG, Ya HE()   

  1. Chengdu Information Technology of Chinese Academy of Sciences Company Limited,Chengdu Sichuan 610213,China
  • Received:2024-07-30 Revised:2024-10-17 Accepted:2024-10-21 Online:2025-01-24 Published:2024-12-31
  • Contact: Ya HE

融合知识图谱与对比学习的企业风险小样本鲁棒识别方法

吕晓斌, 黄浩森, 周鑫, 王近来, 何亚()   

  1. 中科院成都信息技术股份有限公司,成都 610213
  • 通讯作者: 何亚
  • 作者简介:吕晓斌(1978—),男,四川泸州人,高级工程师,主要研究方向:机器学习、数据挖掘、智能Web、智慧城市
    黄浩森(1993—),男,四川成都人,工程师,主要研究方向:机器学习、数据挖掘、智能Web、智慧城市
    周鑫(1990—),男,四川资中人,工程师,主要研究方向:机器学习、大数据、数据挖掘、智能Web
    王近来(1994—),男,河南周口人,工程师,主要研究方向:大数据、机器学习、数据挖掘、智能Web
    何亚(1981—),男,四川成都人,高级工程师,硕士,主要研究方向:智慧城市、大数据、数据挖掘、机器学习。
  • 基金资助:
    西部之光青年学者

Abstract:

Enterprise risk monitoring serves as a critical safeguard for maintaining regional economic stability. However, the complex structures and diverse logical associations of economic data, as well as the involvement of sensitive enterprise information have posed multiple challenges in acquiring and integrating heterogeneous data in risk analysis. Moreover, some enterprises facing operational risks may falsify data, thereby undermining the accuracy and reliability of risk identification severely. Therefore, an innovative few-shot risk learning method, Knowledge Graph Contrastive Learning based on Large Language Model (KGCLM) was proposed. Firstly, a comprehensive enterprise risk knowledge graph was constructed, encompassing multi-dimensional semantic information such as risk events and risk factors, so as to characterize enterprise risk features comprehensively. Secondly, Large Language Model (LLM) was employed to enhance the semantics of risk knowledge through word vectors, thereby solving the problem of sparse risk data to some extent and improving the model’s semantic understanding capabilities. Thirdly, a heterogeneous Graph Neural Network (GNN) model was designed to model uniformly and perform representation learning to cross-modal risk data, including enterprise registration, investment, judicial issues, and public opinion, thereby achieving effective fusion of multi-source heterogeneous data. Finally, a contrastive learning mechanism was introduced, and by constructing positive and negative sample pairs, the model’s ability to maintain consistent representations for similar samples and distinguish between different samples were enhanced, thereby increasing the model’s robustness against falsified data significantly. Experimental results on Small Enterprise Risk Dataset (SERD) and China Enterprise Risk Dataset (CERD) demonstrate that KGCLM outperforms the baseline models significantly in terms of accuracy and various F1 scores. Specifically, KGCLM achieves an accuracy of 90.37% on SERD and 74.51% on CERD. The above validates the method’s superior performance in handling data scarcity and falsified data interference.

Key words: investment attraction, risk identification, contrastive learning, Graph Neural Network (GNN), multi-source heterogeneous data

摘要:

企业风险监测是维护区域经济稳定的重要保障。然而,经济数据结构复杂、逻辑关联多样,且涉及企业敏感信息,导致异构数据的获取与融合在风险分析中面临诸多挑战。此外,部分存在经营风险的企业可能谎报数据,这一行为严重削弱了风险识别的准确性和可靠性。为此,提出一种新的风险小样本学习方法——基于大语言模型的知识图谱对比学习(KGCLM)方法。首先,构建全面的企业风险知识图谱,涵盖风险事件、风险因子等多维语义信息,以全面刻画企业风险特征;其次,利用大语言模型(LLM)对风险知识进行词向量语义增强,从而一定程度上解决风险数据稀疏的问题,提升模型的语义理解能力;然后,设计异构图神经网络(GNN)模型对跨模态风险数据(包括企业注册、投资、司法、舆情等)进行统一建模和表征学习,实现多源异构数据的有效融合;最后,引入对比学习机制,通过构建正负样本对提升模型对相似样本的一致性表示能力和对不同样本的区分能力,显著增强模型在面对谎报数据时的鲁棒性。在中小企业风险数据集(SERD)和中国上市公司风险数据集(CERD)上的实验结果表明,KGCLM在准确率和各类F1分数均显著优于对比实验中的基线模型。在SERD上,KGCLM的准确率达到了90.37%;在CERD上,KGCLM的准确率为74.51%,验证了所提方法在处理数据稀缺和欺骗性数据干扰方面的优越性能。

关键词: 招商引资, 风险识别, 对比学习, 图神经网络, 多源异构数据

CLC Number: