Research of Open-Source Data Collection and Processing Techniques Enhanced by Large Language Models

doi:10.11772/j.issn.1001-9081.2025111455

Journal of Computer Applications

Received:2025-12-15 Revised:2026-02-27 Accepted:2026-03-16 Online:2026-04-06 Published:2026-04-06

增强大模型的开源数据采编技术方法

郑林,张俭鸽,贾子硕,王潇飞

信息工程大学数据与目标工程学院

通讯作者: 郑林

Abstract

Abstract: Abstract: In response to the challenges posed by multi-source heterogeneous open-source data in the big data era, where traditional methods struggle to meet the demands for generalizability and efficient processing, this paper proposes an enhanced large language model (LLM)-based method for open-source data collection and processing. By leveraging prompt engineering and a self-building and self-matching knowledge base (KBSM) to enhance the LLM's information extraction capabilities, the proposed method integrates the three stages of data acquisition, data organization, and data storage. It utilizes multiple search engines for collecting open-source data, designs a three-tier matching deduplication strategy to achieve semantic-level deduplication, and optimizes the framework using multi-threading and coroutine techniques. Experimental results demonstrate that the combination of prompt engineering and KBSM yields the optimal performance for open-source data extraction, achieving a success rate of up to 88%. At the same time, they demonstrate an argument detection capability that exceeds that of other methods on the DuEE1.0 data set. The argument recall rate reaches 91.41%. The three-tier matching deduplication strategy improves performance by 78.97% compared to database deduplication. Furthermore, the use of multi-threading and coroutines enhances deduplication efficiency by 46.04% while ensuring data consistency. The proposed method effectively enhances the LLM's capability to extract information from open-source data, transforming complex and difficult-to-process open-source data into high-quality, low-redundancy, and utilizable data, while meeting the timeliness requirements for processing massive volumes of open-source data.

Key words: Data Collection and Compilation, Enhanced Large Language Model, KBSM, Semantic Deduplication, Coroutine

摘要： 摘要: 针对大数据时代开源数据多源异构、传统方法难以满足通用性与高效处理需求的问题，提出一种增强大模型的开源数据采编方法。借助提示工程和知识库自构建与自匹配(KBSM)增强大模型的抽取能力，将数据采集、数据整编、数据存储三阶段统一结合，使用多搜索引擎采集开源数据，设计三级匹配去重策略完成语义级去重，并通过多线程和协程技术对框架优化。实验结果表明，提示工程和KBSM增强大模型对开源数据提取最优，成功率最高达88%，同时在DuEE1.0数据集上展现出超过其他方法模型的论元检测能力，召回率达91.41%，三级匹配去重策略与数据库去重相比提升了78.97%，多线程与协程在保证数据一致性上将去重效率提升46.04%。该方法能有效增强大模型对开源数据的抽取能力，将复杂难以处理的开源数据转换为可利用、冗余低的高质量数据，并满足海量开源数据的及时性处理需求。

关键词: 数据采编, 增强大模型, KBSM, 语义去重, 协程

CLC Number:

TP391.1

郑林张俭鸽贾子硕王潇飞. 增强大模型的开源数据采编技术方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025111455.