面向大规模机构分散存储数据的基于属性的实体对齐算法

doi:10.11772/j.issn.1001-9081.2024091388

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (10): 3195-3202.DOI: 10.11772/j.issn.1001-9081.2024091388

• 数据科学与技术 • 上一篇

面向大规模机构分散存储数据的基于属性的实体对齐算法

曹泽毅¹^,², 昌燕¹^,²^,³(), 赖仁鑫¹^,², 张仕斌¹^,²^,³, 秦智¹^,²^,³, 闫丽丽¹^,²^,³, 张雪健¹^,², 狄元灏¹^,²

^1.成都信息工程大学网络空间安全学院（芯谷产业学院），成都 610054
^2.先进密码技术与系统安全四川省重点实验室（成都信息工程大学），成都 610054
^3.先进微处理器技术国家工程研究中心（工业控制与安全分中心），成都 610225

收稿日期:2024-10-07 修回日期:2024-12-19 接受日期:2024-12-20 发布日期:2025-03-14 出版日期:2025-10-10
通讯作者: 昌燕
作者简介:曹泽毅（1998—），男，四川成都人，硕士研究生，主要研究方向：数据挖掘、数据融合、知识图谱、实体对齐
昌燕（1979—），女，内蒙古阿拉善人，教授，博士，CCF会员，主要研究方向：量子计算、信息安全、区块链 Email:cyttkl@cuit.edu.cn
赖仁鑫（1998—），男，四川德阳人，硕士研究生，主要研究方向：数据挖掘
张仕斌（1971—），男，重庆人，教授，博士，主要研究方向：量子计算、信息安全、区块链
秦智（1977—），男，四川资阳人，副教授，硕士，主要研究方向：网络与信息安全、区块链、物联网
闫丽丽（1980—），女，四川成都人，教授，博士，主要研究方向：量子计算、信息安全
张雪健（2000—），男，河北唐山人，硕士研究生，主要研究方向：量子计算、信息安全
狄元灏（2000—），男，四川宜宾人，硕士研究生，主要研究方向：数据挖掘。
基金资助:
国家重点研发计划项目(2022YFB3103103);国家自然科学基金资助项目(62272068);国家自然科学基金资助项目(62076042);国家自然科学基金资助项目(62102049);四川省科技计划项目(2023YFS0419);成都市重点研发支持计划项目(2021-YF09-00114-GX);成都市重点研发支持计划项目(2019-YF005-02028-GX);四川省重点研发计划项目(2021YFSY0012);四川省重点研发计划项目(2020YFG0307);四川省重点研发计划项目(2021YFG0332)

Attribute-based entity alignment algorithm for decentralized data storage in large-scale institutions

Zeyi CAO¹^,², Yan CHANG¹^,²^,³(), Renxin LAI¹^,², Shibin ZHANG¹^,²^,³, Zhi QIN¹^,²^,³, Lili YAN¹^,²^,³, Xuejian ZHANG¹^,², Yuanhao DI¹^,²

^1.School of Cybersecurity （Xin Gu Industrial College），Chengdu University of Information Technology，Chengdu Sichuan 610054，China
^2.Advanced Cryptography and System Security Key Laboratory of Sichuan Province （Chengdu University of Information Technology），Chengdu Sichuan 610054，China
^3.SUGON Industrial Control and Security Center，Chengdu Sichuan 610225，China

Received:2024-10-07 Revised:2024-12-19 Accepted:2024-12-20 Online:2025-03-14 Published:2025-10-10
Contact: Yan CHANG
About author:CAO Zeyi， born in 1998， M. S. candidate. His research interests include data mining， data fusion， knowledge graph， entity alignment.
CHANG Yan， born in 1979， Ph. D.， professor. Her research interests include quantum computing， information security， blockchain.
LAI Renxin， born in 1998， M. S. candidate. His research interests include data mining.
ZHANG Shibin， born in 1971， Ph. D.， professor. His research interests include quantum computing， information security， blockchain.
QIN Zhi， born in 1977， M. S.， associate professor. His research interests include network and information security， blockchain， internet of things.
YAN Lili， born in 1980， Ph. D.， professor. Her research interests include quantum computing， information security.
ZHANG Xuejian， born in 2000， M. S. candidate. His research interests include quantum computing， information security.
DI Yuanhao， born in 2000， M. S. candidate. His research interests include data mining.
Supported by:
National Key Research and Development Plan of China(2022YFB3103103);National Natural Science Foundation of China(62272068);Sichuan Science and Technology Program(2023YFS0419);Key Research and Development Support Program of Chengdu(2021-YF09-00114-GX);Key Research and Development Program of Sichuan Province(2021YFSY0012)

摘要/Abstract

摘要：

大规模机构分散存储的数据实体存在数据冗余、信息缺失和不一致等问题，需要通过实体对齐进行集成融合。现有的实体对齐方法大多依赖实体的结构信息，通过子图匹配进行对齐，但分散存储数据的结构信息匮乏，导致对齐效果不佳。为解决上述问题，并支撑重要数据的识别，提出一种单层图神经网络的基于属性的实体对齐模型。首先，使用单层图神经网络避免次级邻居节点的信息干扰；其次，设计基于信息熵的属性赋权方法，从而在初始阶段快速区分属性的重要程度；最后，构建基于注意力机制的编码器，以结合局部和全局视角表征不同属性在对齐中的重要程度，更全面地表征实体信息。实验结果表明，在2个分散存储数据集上，相较于次优模型，所提模型的前1位命中率（Hits@1）分别提升了5.24和2.03个百分点。可见，所提模型的对齐效果优于其他实体对齐方法。

关键词: 重要数据识别, 数据融合, 信息熵, 实体对齐, 注意力机制

Abstract:

The data entities stored in large-scale decentralized institutions have issues such as data redundancy， missing information， and inconsistency， which requires integration through entity alignment. Most existing entity alignment methods rely on structural information of entities and perform alignment through subgraph matching. However， the lack of structural information in decentralized data storage will lead to poor alignment results. To address this issue and support identification of important data， a single-layer graph neural network-based attribute-based entity alignment model was proposed. Firstly， a single-layer graph neural network was utilized to avoid interference from secondary neighbor node information. Secondly， an attribute weighting method based on information entropy was designed to distinguish importance of the attributes in the initial stage quickly. Finally， an attention mechanism-based encoder was constructed to represent importance of different attributes in alignment from both local and global perspectives， thereby providing a more comprehensive representation of entity information. Experimental results indicate that on two decentralized storage datasets， the proposed model improves the Hits@1 by 5.24 and 2.03 percentage points， respectively， compared to the suboptimal models， demonstrating superior alignment performance of the proposed model over other entity alignment methods.

Key words: important data identification, data fusion, information entropy, entity alignment, attention mechanism

中图分类号:

TP391.1

曹泽毅, 昌燕, 赖仁鑫, 张仕斌, 秦智, 闫丽丽, 张雪健, 狄元灏. 面向大规模机构分散存储数据的基于属性的实体对齐算法[J]. 计算机应用, 2025, 45(10): 3195-3202.

Zeyi CAO, Yan CHANG, Renxin LAI, Shibin ZHANG, Zhi QIN, Lili YAN, Xuejian ZHANG, Yuanhao DI. Attribute-based entity alignment algorithm for decentralized data storage in large-scale institutions[J]. Journal of Computer Applications, 2025, 45(10): 3195-3202.

图/表 12

图1 数据分散存储的示意图

Fig. 1 Schematic diagram of distributed data storage

图2 实体对齐的示意图

Fig. 2 Schematic diagram of entity alignment

图3 模型原理的示意图

Fig. 3 Schematic diagram of model principle

图4 单层图神经网络注意力的示意图

Fig. 4 Schematic diagram of attention in single-layer graph neural network

图5 实体E1和实体E2属性集合的示意图

Fig. 5 Schematic diagrams of attribute sets of entity E1 and entity E2

图6 融合注意力机制编码器

Fig. 6 Fusion attention mechanism encoder

表1 CTD的详细信息

Tab. 1 Detailed information of CTD

总属性

类型数

实体种子

对数

表2 WIKI_MOVIE和IMDB_MOVIE数据集的详细信息

Tab. 2 Detailed information of WIKI_MOVIE and IMDB_MOVIE datasets

数据集

实体数

属性

种类数

属性

三元组数

总属性

类型数

实体种子

对数

表3 不同模型在CTD上的对比结果

Tab. 3 Comparison results of different models on CTD

模型	Hits@1/%	Hits@10/%	MRR	MR
BERT-INT	29.18	31.43	0.228	4.39
MTransE	25.20	33.12	0.318	3.14
GCN-Align	31.23	41.20	0.332	3.01
PipEA	74.62	81.12	0.625	1.61
RDGCN	75.23	80.10	0.582	1.72
AttrGNN	80.24	94.56	0.872	1.15
AutoAlign	85.21	96.43	0.901	1.10
本文模型	90.45	99.60	0.937	1.07

表4 不同模型在WIKI_MOVIE-IMDB_MOVIE数据集上的对比结果

Tab. 4 Comparison results of different models on WIKI_MOVIE-IMDB_MOVIE dataset

模型	Hits@1/%	Hits@10/%	MRR
TransE	96.36	97.50	0.983
TranSparse	95.72	97.00	—
MultiKE	95.25	96.50	—
SEEA	96.42	98.00	—
本文模型	98.45	99.76	0.997

表5 消融实验的对比结果

Tab. 5 Comparison results of ablation experiments

组号	有 *init_weight*			无 *init_weight*
组号	Hits@1/%	Hits@10/%	MRR	Hits@1/%	Hits@10/%	MRR
1	89.80	99.80	0.930	87.80	99.50	0.920
2	91.00	99.75	0.936	88.99	99.50	0.927
3	90.55	99.88	0.935	89.52	99.58	0.930
4	91.20	99.90	0.938	89.80	99.65	0.930
5	91.50	99.92	0.939	90.00	99.60	0.932

图7 消融实验5组训练过程的Hits@1、Hits@10和MRR对比

Fig. 7 Comparison of Hits@1， Hits@10， and MRR in five training processes of ablation experiments

参考文献 24

[1]	全国人民代表大会常务委员会. 中华人民共和国数据安全法［EB/OL］. ［2024-02-12］. .
	Standing Committee of the National People’s Congress. Data security law of the People’s Republic of China［EB/OL］. ［2024-02-12］. .
[2]	WANG Z， LV Q， LAN X， et al. Cross-lingual knowledge graph alignment via graph convolutional networks［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2018： 349-357.
[3]	WU Y， LIU X， FENG Y， et al. Relation-aware entity alignment for heterogeneous knowledge graphs［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2019： 5278-5284.
[4]	CHURCH K W. Word2Vec［J］. Natural Language Engineering， 2017， 23（1）： 155-162.
[5]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[6]	TANG X， ZHANG J， CHEN B， et al. BERT-INT： a BERT-based interaction model for knowledge graph alignment［C］// Proceedings of the 29th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2020： 3174-3180.
[7]	GORENSTEIN L， KONEN E， GREEN M， et al. Bidirectional encoder representations from Transformers in radiology： a systematic review of natural language processing applications［J］. Journal of the American College of Radiology， 2024， 21（6）： 914-941.
[8]	TRISEDYA B D， QI J， ZHANG R. Entity alignment between knowledge graphs using attribute embeddings［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2019： 297-304.
[9]	LIU Z， CAO Y， PAN L， et al. Exploring and evaluating attributes， values， and structures for entity alignment［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2020： 6355-6364.
[10]	PEI S， YU L， YU G， et al. REA： robust cross-lingual entity alignment between knowledge graphs［C］// Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2020： 2175-2184.
[11]	MEGAHED M， MOHAMMED A. A comprehensive review of generative adversarial networks： fundamentals， applications， and challenges［J］. WIREs Computational Statistics， 2024， 16（1）： No.e1629.
[12]	单力秋. 噪声敏感的关系感知跨语言实体对齐方法研究［D］. 阜新：辽宁工程技术大学， 2022.
	SHAN L Q. Research on noise sensitive relationship aware cross-lingual entity alignment method［D］. Fuxin： Liaoning Technical University， 2022.
[13]	RAOUFI E， HAPPI B G H， LARMANDE P， et al. An analysis of the performance of representation learning methods for entity alignment： benchmark vs. real-world data［J/OL］. Semantic Web Journal （by IOS Press）［2024-02-12］..
[14]	AUER S， BIZER C， KOBILAROV G， et al. DBpedia： a nucleus for a web of open data［C］// Proceedings of the 2007 Asian Semantic Web Conference International Semantic Web Conference， LNCS 4825. Berlin： Springer， 2007： 722-735.
[15]	LIANG P， CHEN Y， SUN Y， et al. An information entropy-driven evolutionary algorithm based on reinforcement learning for many-objective optimization［J］. Expert Systems with Applications， 2024， 238（Pt E）： No.122164.
[16]	AHMETAJ S， EFTHYMIU V， FAGIN R， et al. Ontology-enriched query answering on relational databases［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021： 15247-15254.
[17]	TARUS J K， NIU Z， MUSTAFA G. Knowledge-based recommendation： a review of ontology-based recommender systems for e-learning［J］. Artificial Intelligence Review， 2018， 50（1）： 21-48.
[18]	BORDES A， USUNIER N， GARCIA-DURAN A， et al. Translating embeddings for modeling multi-relational data［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems. New York： ACM， 2013： 2787-2795.
[19]	CHEN M， TIAN Y， YANG M， et al. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2017： 1511-1517.
[20]	WANG Y， TANG W， SUN H， et al. Understanding and guiding weakly supervised entity alignment with potential isomorphism propagation［EB/OL］. ［2024-09-23］..
[21]	ZHANG R， SU Y， TRISEDYA B D， et al. AutoAlign： fully automatic and effective knowledge graph alignment enabled by large language models［J］. IEEE Transactions on Knowledge and Data Engineering， 2024， 36（6）： 2357-2371.
[22]	JI G， LIU K， HE S， et al. Knowledge graph completion with adaptive sparse transfer matrix［C］// Proceedings of the 30th AAAI Conference on Artificial Intelligence. New York： ACM， 2016： 985-991.
[23]	HU W， ZHANG Q， SUN Z， et al. MultiKE： a multi-view knowledge graph embedding framework for entity alignment［C］// Proceedings of the 14th International Workshop on Ontology Matching co-located with the 18th International Semantic Web Conference. ［S. l.］： CEUR-WS.org， 2019： 189-190.
[24]	GUAN S， JIN X， WANG Y， et al. Self-learning and embedding based entity alignment［C］// Proceedings of the 2017 IEEE International Conference on Big Knowledge. Piscataway： IEEE， 2017： 33-40.

面向大规模机构分散存储数据的基于属性的实体对齐算法

Attribute-based entity alignment algorithm for decentralized data storage in large-scale institutions

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 24

相关文章 15

编辑推荐

Metrics

[1]	邓伊琳, 余发江. 基于LSTM和可分离自注意力机制的伪随机数生成器[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2893-2901.
[2]	李维刚, 邵佳乐, 田志强. 基于双注意力机制和多尺度融合的点云分类与分割网络[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 3003-3010.
[3]	王翔, 陈志祥, 毛国君. 融合局部和全局相关性的多变量时间序列预测方法[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2806-2816.
[4]	吕景刚, 彭绍睿, 高硕, 周金. 复频域注意力和多尺度频域增强驱动的语音增强网络[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2957-2965.
[5]	吴海峰, 陶丽青, 程玉胜. 集成特征注意力和残差连接的偏标签回归算法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2530-2536.
[6]	敬超, 全育涛, 陈艳. 基于多层感知机-注意力模型的功耗预测算法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2646-2655.
[7]	林进浩, 罗川, 李天瑞, 陈红梅. 基于跨尺度注意力网络的胸部疾病分类方法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2712-2719.
[8]	周金, 李玉芝, 张徐, 高硕, 张立, 盛家川. 复杂电磁环境下的调制识别网络[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2672-2682.
[9]	梁辰, 王奕森, 魏强, 杜江. 基于Tsransformer-GCN的源代码漏洞检测方法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2296-2303.
[10]	刘皓宇, 孔鹏伟, 王耀力, 常青. 基于多视角信息的行人检测算法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2325-2332.
[11]	赵小强, 柳勇勇, 惠永永, 刘凯. 基于改进时域卷积网络与多头自注意力机制的间歇过程质量预测模型[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2245-2252.
[12]	王慧斌, 胡展傲, 胡节, 徐袁伟, 文博. 基于分段注意力机制的时间序列预测模型[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2262-2268.
[13]	王艺涵, 路翀, 陈忠源. 跨模态文本信息增强的多模态情感分析模型[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2237-2244.
[14]	颜文婧, 王瑞东, 左敏, 张青川. 基于风味嵌入异构图层次学习的食谱推荐模型[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1869-1878.
[15]	王海杰, 张广鑫, 史海, 陈树. 基于实体表示增强的文档级关系抽取[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1809-1816.