《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (8): 2431-2438.DOI: 10.11772/j.issn.1001-9081.2022071108
所属专题: 数据科学与技术
杨婷, 莫若玉, 张秀娟, 朱洲森
收稿日期:2022-07-29
									
				
											修回日期:2022-09-19
									
				
											接受日期:2022-09-19
									
				
											发布日期:2023-01-15
									
				
											出版日期:2023-08-10
									
				
			通讯作者:
					朱洲森
							作者简介:杨婷(1997—),女,四川广元人,硕士研究生,主要研究方向:智能信息处理基金资助:Ting YANG, Ruoyu MO, Xiujuan ZHANG, Zhousen ZHU
Received:2022-07-29
									
				
											Revised:2022-09-19
									
				
											Accepted:2022-09-19
									
				
											Online:2023-01-15
									
				
											Published:2023-08-10
									
			Contact:
					Zhousen ZHU   
							About author:YANG Ting, born in 1997, M. S. candidate. Her research interests include intelligent information processing.Supported by:摘要:
针对关系型数据库(RDB)现有的全文搜索方案存在的效率低下、资源占用高的问题,提出一种具有增强式辅助缓存的轻量级关系型数据库全文搜索模型。首先,该模型构建基于Redis的倒排索引,并利用缓存索引缩小搜索范围,从而用内存高效的数据处理能力解决关系型数据库I/O瓶颈,并提升系统整体性能;其次,为保证搜索结果的准确性和时效性,进一步提出索引同步策略,而且设计并实现了增量索引组件来隐藏索引处理细节,从而提高模型的易用性和通用性;最后,对于热点数据提供一种基于访问热度的索引更新机制,以降低倒排索引的内存占用。实验结果表明,所提模型在保证关系型数据库全文搜索响应速度和准确度的前提下,空间资源消耗比MySQL全文索引降低了48.8%~60.9%,比Elasticsearch降低了85.2%~96.2%,证明所提模型在实际应用中可行且有效。
中图分类号:
杨婷, 莫若玉, 张秀娟, 朱洲森. 轻量级缓存策略的关系型数据库全文搜索加强与扩展[J]. 计算机应用, 2023, 43(8): 2431-2438.
Ting YANG, Ruoyu MO, Xiujuan ZHANG, Zhousen ZHU. Enhancement and expansion of full-text search in relational databases based on lightweight caching strategy[J]. Journal of Computer Applications, 2023, 43(8): 2431-2438.
| id | stem | answer | 
|---|---|---|
| 1 | Redis支持的数据类型 | String,Hash,List,Set和Zset | 
| 2 | 消息队列在存取消息时要满足的3个需求 | 消息保序、处理重复的消息和保证消息可靠性 | 
| 3 | Redis中Hash的实现 | Hash类型的底层数据结构是由压缩列表或哈希表实现的 | 
表1 question表数据
Tab. 1 Data from table question
| id | stem | answer | 
|---|---|---|
| 1 | Redis支持的数据类型 | String,Hash,List,Set和Zset | 
| 2 | 消息队列在存取消息时要满足的3个需求 | 消息保序、处理重复的消息和保证消息可靠性 | 
| 3 | Redis中Hash的实现 | Hash类型的底层数据结构是由压缩列表或哈希表实现的 | 
| 词元 | 倒排列表 | 
|---|---|
| Redis | 〈question: stem;{1,3}〉 | 
| 支持 | 〈question: stem;{1}〉 | 
| 数据 | 〈question: stem;{1}〉,〈question: answer;{1}〉 | 
| 类型 | 〈question: stem;{1}〉,〈question: answer;{3}〉 | 
| Hash | 〈question: stem;{1}〉,〈question: answer;{1,3}〉 | 
表2 部分索引数据结构
Tab. 2 Partial index data structure
| 词元 | 倒排列表 | 
|---|---|
| Redis | 〈question: stem;{1,3}〉 | 
| 支持 | 〈question: stem;{1}〉 | 
| 数据 | 〈question: stem;{1}〉,〈question: answer;{1}〉 | 
| 类型 | 〈question: stem;{1}〉,〈question: answer;{3}〉 | 
| Hash | 〈question: stem;{1}〉,〈question: answer;{1,3}〉 | 
| 方法 | 自定义 词典 | 数字 识别 | 人名 识别 | 机构名 识别 | 新词 发现 | 
|---|---|---|---|---|---|
| BaseAnalysis | × | √ | × | × | × | 
| ToAnalysis | √ | √ | √ | × | × | 
| NlpAnalysis | √ | √ | √ | √ | √ | 
| IndexAnalysis | √ | √ | √ | × | × | 
表3 Ansj功能统计
Tab. 3 Ansj function statistics
| 方法 | 自定义 词典 | 数字 识别 | 人名 识别 | 机构名 识别 | 新词 发现 | 
|---|---|---|---|---|---|
| BaseAnalysis | × | √ | × | × | × | 
| ToAnalysis | √ | √ | √ | × | × | 
| NlpAnalysis | √ | √ | √ | √ | √ | 
| IndexAnalysis | √ | √ | √ | × | × | 
| 属性 | 说明 | 
|---|---|
| sql | 原查询语句 | 
| sCol | 查询字段 | 
| Condition | 条件表达式(字段、条件运算符和条件值) | 
| fTable | 表名 | 
表4 ParseQuery类部分属性(部分)
Tab. 4 Properties of ParseQuery (part)
| 属性 | 说明 | 
|---|---|
| sql | 原查询语句 | 
| sCol | 查询字段 | 
| Condition | 条件表达式(字段、条件运算符和条件值) | 
| fTable | 表名 | 
| 属性 | 数据类型 | 说明 | 
|---|---|---|
| sqlType | String | 操作类型 | 
| tbName | String | 表单 | 
| cols | String[] | 列名集合 | 
| colValues | String[] | 列值集合 | 
| condition | Condition | 更新条件 | 
表5 ParseUpdate类属性(部分)
Tab. 5 Properties of ParseUpdate (part)
| 属性 | 数据类型 | 说明 | 
|---|---|---|
| sqlType | String | 操作类型 | 
| tbName | String | 表单 | 
| cols | String[] | 列名集合 | 
| colValues | String[] | 列值集合 | 
| condition | Condition | 更新条件 | 
| 数据 条数/104 | 数据 大小/MB | FullText索引 | Elasticsearch | 缓存模式 | |||
|---|---|---|---|---|---|---|---|
| 大小/MB | 耗时/s | 大小/MB | 耗时/s | 大小/MB | 耗时/s | ||
| 1 | 100.67 | 2.15 | 1.6 | 28.8 | 53 | 1.10 | 1.4 | 
| 50 | 625.00 | 12.54 | 3.6 | 105.1 | 290 | 4.90 | 24.0 | 
| 100 | 1 996.80 | 22.56 | 7.8 | 140.8 | 480 | 9.46 | 39.0 | 
| 500 | 4 853.76 | 101.68 | 29.0 | 327.6 | 1 852 | 48.40 | 155.0 | 
表6 索引全量构建时空开销对比
Tab. 6 Comparison of space and time costs of full-index construction
| 数据 条数/104 | 数据 大小/MB | FullText索引 | Elasticsearch | 缓存模式 | |||
|---|---|---|---|---|---|---|---|
| 大小/MB | 耗时/s | 大小/MB | 耗时/s | 大小/MB | 耗时/s | ||
| 1 | 100.67 | 2.15 | 1.6 | 28.8 | 53 | 1.10 | 1.4 | 
| 50 | 625.00 | 12.54 | 3.6 | 105.1 | 290 | 4.90 | 24.0 | 
| 100 | 1 996.80 | 22.56 | 7.8 | 140.8 | 480 | 9.46 | 39.0 | 
| 500 | 4 853.76 | 101.68 | 29.0 | 327.6 | 1 852 | 48.40 | 155.0 | 
| 数据条数/104 | Like匹配 | FullText索引 | Elasticsearch | 缓存模式 | 
|---|---|---|---|---|
| 10 | 382 | 41 | 18 | 24 | 
| 50 | 2 146 | 1 094 | 101 | 360 | 
| 100 | 4 140 | 2 055 | 718 | 1 215 | 
表7 数据检索响应时间对比
Tab. 7 Comparison of data retrieval response time
| 数据条数/104 | Like匹配 | FullText索引 | Elasticsearch | 缓存模式 | 
|---|---|---|---|---|
| 10 | 382 | 41 | 18 | 24 | 
| 50 | 2 146 | 1 094 | 101 | 360 | 
| 100 | 4 140 | 2 055 | 718 | 1 215 | 
| 查询文本 | Like匹配 | FullText | Elasticsearch | 缓存模式 | 
|---|---|---|---|---|
| 通信电缆 | 0 | 0 | 7 001 | 7 001 | 
| 船舶电气钳工 | 1 000 | 1 000 | 13 000 | 13 000 | 
| 飞机测量 | 0 | 0 | 11 000 | 11 000 | 
| 电信交换 | 1 000 | 1 000 | 2 000 | 2 000 | 
表8 多关键词检索命中数对比
Tab. 8 Comparison of hits of multi-keyword retrieval
| 查询文本 | Like匹配 | FullText | Elasticsearch | 缓存模式 | 
|---|---|---|---|---|
| 通信电缆 | 0 | 0 | 7 001 | 7 001 | 
| 船舶电气钳工 | 1 000 | 1 000 | 13 000 | 13 000 | 
| 飞机测量 | 0 | 0 | 11 000 | 11 000 | 
| 电信交换 | 1 000 | 1 000 | 2 000 | 2 000 | 
| 1 | FERNÁNDEZ CANDEL C J, RUIZ D S, GARCÍA-MOLINA J J. A unified metamodel for NoSQL and relational databases[J]. Information Systems, 2022, 104: No.101898. 10.1016/j.is.2021.101898 | 
| 2 | YILMAZEL O, YUREKLI B, YILMAZEL B, et al. Relational databases versus information retrieval systems: a case study[C/OL]// Proceedings of the 2009 IADIS International Conference Applied Computing [2022-05-22].. | 
| 3 | 鲁佳文,严丽. 对象关系数据库到RDF(S)的映射方法[J]. 计算机科学, 2021, 48(10):145-151. 10.11896/jsjkx.200800006 | 
| LU J W, YAN L. Mapping method from object-relational database to RDF(S)[J]. Computer Science, 2021, 48(10): 145-151. 10.11896/jsjkx.200800006 | |
| 4 | ČEREŠŇÁK R, KVET M. Comparison of query performance in relational a non-relation databases[J]. Transportation Research Procedia, 2019, 40: 170-177. 10.1016/j.trpro.2019.07.027 | 
| 5 | JIN D W, CHEN G, HAO W N, et al. Whole database retrieval method of general relational database based on Lucene[C]// Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Computer Applications. Piscataway: IEEE, 2020: 1277-1279. 10.1109/icaica50127.2020.9182496 | 
| 6 | GYŐRÖDI C A, DUMŞE-BURESCU D V, ZMARANDA D R, et al. Performance analysis of NoSQL and relational databases with CouchDB and MySQL for application’s data storage[J]. Applied Sciences, 2020, 10(23): No.8524. 10.3390/app10238524 | 
| 7 | 吴小青. 嵌入式数据库(MySQL)中数据模糊检索方法解析[J]. 软件, 2020, 41(9):115-118. 10.3969/j.issn.1003-6970.2020.09.032 | 
| WU X Q. Analysis of data fuzzy retrieval methods in embedded database (MySQL)[J]. Software, 2020, 41(9): 115-118. 10.3969/j.issn.1003-6970.2020.09.032 | |
| 8 | KELEC A, DUJLOVIC I, OBRADOVIC N. One approach for full-text search of files in MongoDB based systems[C]// Proceedings of the 18th International Symposium INFOTEH-JAHORINA. Piscataway: IEEE, 2019: 1-6. 10.1109/infoteh.2019.8717777 | 
| 9 | CHAITANYA B S S K, REDDY D A K, CHANDRA B P S E, et al. Full-text search using database index[C]// Proceedings of the 5th International Conference on Computing, Communication, Control and Automation. Piscataway: IEEE, 2019: 1-5. 10.1109/iccubea47591.2019.9128683 | 
| 10 | 吴东明,王丽娟. 基于MySQL的中文全文搜索研究[J]. 电脑知识与技术, 2017, 13(32):7-8. | 
| WU D M, WANG L J. Research on Chinese full text search based on MySQL[J]. Computer Knowledge and Technology, 2017, 13(32): 7-8. | |
| 11 | SHAH N, WILLICK D, MAGO V. A framework for social media data analytics using Elasticsearch and Kibana[J]. Wireless Networks, 2022, 28(3): 1179-1187. 10.1007/s11276-018-01896-2 | 
| 12 | MARINŠEK B. Analiza integracije iskalnega mehanizma Elasticsearch na primeru sistema eArhiv Pošte Slovenije[D]. Maribor : University of Maribor, 2018. | 
| 13 | TAWARE U, SHAIKH N. Heterogeneous database system for faster data querying using Elasticsearch[C]// Proceedings of the 4th International Conference on Computing Communication Control and Automation. Piscataway: IEEE, 2018: 1-4. 10.1109/iccubea.2018.8697437 | 
| 14 | ZMARANDA D R, MOISI C I, GYŐRÖDI C A, et al. An analysis of the performance and configuration features of MySQL document store and Elasticsearch as an alternative backend in a data replication solution[J]. Applied Sciences, 2021, 11(24): No.11590. 10.3390/app112411590 | 
| 15 | 刘泽恩. 基于Elasticsearch的垂直搜索推荐系统研究[D]. 武汉:湖北大学, 2021:56-58. 10.47056/0365-9615-2022-174-9-361-368 | 
| LIU Z E. Research of vertical search recommendation system based on Elasticsearch[D]. Wuhan: Hubei University, 2021:56-58. 10.47056/0365-9615-2022-174-9-361-368 | |
| 16 | 王伟,魏乐,刘文清,等. 基于ElasticSearch的分布式全文搜索系统[J]. 电子科技, 2018, 31(8):56-59, 65. | 
| WANG W, WEI L, LIU W Q, et al. Design and implementation on distributed full-text search system based on ElasticSearch[J]. Electronic Science and Technology, 2018, 31(8): 56-59, 65. | |
| 17 | 梁爽,赵宝军,张海霞. 基于Elasticsearch的海量数据入库及快速检索方法研究[J]. 测绘与空间地理信息, 2020, 43(12):74-76. 10.3969/j.issn.1672-5867.2020.12.019 | 
| LIANG S, ZHAO B J, ZHANG H X. Research on storage and quick retrieval of massive data based on Elasticsearch[J]. Geomatics & Spatial Information Technology, 2020, 43(12): 74-76. 10.3969/j.issn.1672-5867.2020.12.019 | |
| 18 | BHANDARKAR S, BN N. A full-text-based search algorithm vs Elasticsearch[J]. Studies in Indian Place Names, UGC Care Journal, 2020, 40(74): 2168-2171. | 
| 19 | 毛海宇,舒继武,李飞,等. 内存计算研究进展[J]. 中国科学:信息科学, 2021, 51(2):173-205. 10.1360/ssi-2020-0037 | 
| MAO H Y, SHU J W, LI F, et al. Development of progressing-in-memory[J]. SCIENTIA SINICA Informationis, 2021, 51(2):173-205. 10.1360/ssi-2020-0037 | |
| 20 | 葛微,罗圣美,周文辉,等. HiBase:一种基于分层式索引的高效HBase查询技术与系统[J]. 计算机学报, 2016, 39(1):140-153. 10.11897/SP.J.1016.2016.00140 | 
| GE W, LUO S M, ZHOU W H, et al. HiBase: a hierarchical indexing mechanism and system for efficient HBase query[J]. Chinese Journal of Computers, 2016, 39(1):140-153. 10.11897/SP.J.1016.2016.00140 | |
| 21 | ZULFA M I, FADLI A, WARDHANA A W. Application caching strategy based on in-memory using Redis server to accelerate relational data access[J]. Jurnal Teknologi dan Sistem Komputer, 2020, 8(2): 157-163. 10.14710/jtsiskom.8.2.2020.157-163 | 
| 22 | SU Q K, GAO X, ZHANG X, et al. A novel cache strategy leveraging Redis with filters to speed up queries[C]// Proceedings of the SPIE 12162, International Conference on High Performance Computing and Communication. Bellingham, WA: SPIE, 2022: No.121620O. 10.1117/12.2628119 | 
| 23 | EDDELBUETTEL D. A brief introduction to Redis[EB/OL]. (2022-03-13) [2022-04-10].. | 
| 24 | MATALLAH H, BELALEM G, BOUAMRANE K. Evaluation of NoSQL databases: MongoDB, Cassandra, HBase, Redis, Couchbase, OrientDB[J]. International Journal of Software Science and Computational Intelligence, 2020, 12(4): 71-91. 10.4018/ijssci.2020100105 | 
| 25 | CHEN X, WANG F J, XU J M, et al. A distributed cache system based on Redis for high-speed railway catenary monitoring system[C]// Proceedings of the 2020 Chinese Automation Congress. Piscataway: IEEE, 2020: 2048-2053. 10.1109/cac51589.2020.9326531 | 
| 26 | 黄立冬. 分布式搜索引擎中关键词倒排索引方法仿真[J]. 计算机仿真, 2019, 36(8):380-383, 398. 10.3969/j.issn.1006-9348.2019.08.079 | 
| HUANG L D. Simulation of keyword inverted index method in distributed search engine[J]. Computer Simulation, 2019, 36(8): 380-383, 398. 10.3969/j.issn.1006-9348.2019.08.079 | |
| 27 | 王莱,周腾达,王正飞,等. 基于布隆过滤器和B+树构建倒排索引的电子病历密文搜索[J]. 计算机应用与软件, 2021, 38(4):276-280, 333. | 
| WANG L, ZHOU T D, WANG Z F, et al. Search on encrypted electronic medical records using inverted index based on bloom filter and B+tree[J]. Computer Applications and Software, 2021, 38(4): 276-280, 333. | |
| 28 | 李子骅. Redis入门指南[M]. 北京:人民邮电出版社, 2013:19-57. | 
| LI Z H. Redis Getting Started Guide[M]. Beijing: Posts & Telecom Press, 2013:19-57. | |
| 29 | 霍晨鹏. 科技专家遴选系统关键技术研究与实现[D]. 广州:华南理工大学, 2020:9-10. | 
| HUO C P. The research and implementation of key techniques in science-technology expert selection system[D]. Guangzhou: South China University of Technology, 2020:9-10. | |
| 30 | 邱涛,王斌,舒昭维,等. 面向关系数据库的智能索引调优方法[J]. 软件学报, 2020, 31(3):634-647. | 
| QIU T, WANG B, SHU Z W, et al. Intelligent index tuning approach for relational databases[J]. Journal of Software, 2020, 31(3): 634-647. | |
| 31 | 李燚,顾乃杰,黄增士,等. Redis集群可靠性的研究与优化[J]. 计算机工程, 2018, 44(5):40-46. | 
| LI Y, GU N J, HUANG Z S, et al. Research and optimization of Redis cluster reliability[J]. Computer Engineering, 2018, 44(5): 40-46. | |
| 32 | TIAN Q, LI J M, ZHAO S, et al. A cache consistency protocol with improved architecture[J]. International Journal of Performability Engineering, 2018, 14(1):178-185. | 
| 33 | ÖZSU M T, VALDURIEZ P. Principles of Distributed Database Systems[M]. 2nd ed. Upper Saddle River, NJ: Prentice Hall, 1999:247-248. | 
| 34 | MEGIDDO N, MODHA D S. Outperforming LRU with an adaptive replacement cache algorithm[J]. Computer, 2004, 37(4): 58-65. 10.1109/mc.2004.1297303 | 
| 35 | MELNIK S, GUBAREV A, LONG J J, et al. Dremel: interactive analysis of web-scale datasets[J]. Proceedings of the VLDB Endowment, 2010, 3(1/2):330-339. 10.14778/1920841.1920886 | 
| 36 | 瞿龙俊. 基于HBase的交通流数据实时存储与查询优化方案的设计与实现[D]. 镇江:江苏大学, 2017:32-33. | 
| QU L J. Design and implementation of HBase-based traffic stream data real-time storage and query optimization[D]. Zhenjiang: Jiangsu University, 2017:32-33. | 
| [1] | 姜文涛, 李宛宣, 张晟翀. 非线性时间一致性的相关滤波目标跟踪[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2558-2570. | 
| [2] | 唐朝君, 夏梅妍, 张华, 谢挺. 动态事件触发下多智能体系统固定时间一致性[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 960-965. | 
| [3] | 张卓, 陈花竹. 基于一致性和多样性的多尺度自表示学习的深度子空间聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 353-359. | 
| [4] | 高婷婷, 姚中原, 贾淼, 斯雪明. 链上链下一致性保护技术综述[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3658-3668. | 
| [5] | 张帅华, 张淑芬, 周明川, 徐超, 陈学斌. 基于半监督联邦学习的恶意流量检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3487-3494. | 
| [6] | 刘秋杰, 万源, 吴杰. 深度双模态源域对称迁移学习的跨模态检索[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 24-31. | 
| [7] | 甘舰文, 陈艳, 周芃, 杜亮. 基于高阶一致性学习的聚类集成算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2665-2672. | 
| [8] | 王瑞琪, 纪淑娟, 曹宁, 郭亚杰. 基于一致性训练的半监督虚假招聘广告检测模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2932-2939. | 
| [9] | 张钟元, 戴炜, 李光昱, 陈小庆, 邓启波. 基于改进人工势场和一致性协议的协同避障算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2644-2650. | 
| [10] | 王嘉欣, 刘成林. 具有反馈控制的多自主体系统迭代学习输出一致性[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2630-2635. | 
| [11] | 高安安, 胡爱花, 江正仙. 事件触发脉冲控制多智能体系统的安全一致[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 140-146. | 
| [12] | 李懿恒, 杜晨曦, 杨燕燕, 李翔宇. 基于伪标签一致度的不平衡数据特征选择算法[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 475-484. | 
| [13] | 吴悦, 雒江涛, 刘锐, 胡钟尹. 基于感知哈希和切块的视频相似度检测方法[J]. 计算机应用, 2021, 41(7): 2070-2075. | 
| [14] | 马子玉, 何明, 刘祖均, 顾凌枫, 刘锦涛. 无人机协同控制研究综述[J]. 计算机应用, 2021, 41(5): 1477-1483. | 
| [15] | 王云燕, 胡爱花. 网络攻击下双层结构多智能体系统一致性[J]. 计算机应用, 2021, 41(5): 1399-1405. | 
| 阅读次数 | ||||||
| 全文 |  | |||||
| 摘要 |  | |||||