《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (1): 206-216.DOI: 10.11772/j.issn.1001-9081.2023091260

• 数据科学与技术 • 上一篇    

基于无监督语义哈希的高效相似题检索模型

佟威1, 何理扬2,3, 李锐2,3, 黄威1, 黄振亚2,3, 刘淇2,3()   

  1. 1.教育部教育考试院, 北京 100084
    2.中国科学技术大学 计算机科学与技术学院, 合肥 230027
    3.认知智能全国重点实验室, 合肥 230088
  • 收稿日期:2023-09-14 修回日期:2023-10-14 接受日期:2023-10-24 发布日期:2023-12-08 出版日期:2024-01-10
  • 通讯作者: 刘淇
  • 作者简介:佟威(1984—),男,河北沧州人,博士,主要研究方向:教育数据挖掘、自然语言处理;
    何理扬(1998—),男,湖南郴州人,博士研究生,CCF会员,主要研究方向:信息检索;
    李锐(2000—),男,安徽六安人,硕士研究生,CCF会员,主要研究方向:信息检索;
    黄威(1995—),男,浙江温州人,博士,CCF会员,主要研究方向:教育数据挖掘、层次化多标签分类;
    黄振亚(1992—),男,安徽合肥人,副教授,博士,CCF会员,主要研究方向:数据挖掘、文本挖掘、知识推理、教育大数据分析;
    第一联系人:刘淇(1986—),男,山东临沂人,教授,博士,CCF会员,主要研究方向:数据挖掘、智慧教育、推荐系统、社交网络分析。
  • 基金资助:
    国家教育考试科研规划课题(GJK2021009);国家重点研发计划项目(2021YFF0901003);国家自然科学基金资助项目(62106244);安徽高校协同创新项目(GXXT-2022-042)

Efficient similar exercise retrieval model based on unsupervised semantic hashing

Wei TONG1, Liyang HE2,3, Rui LI2,3, Wei HUANG1, Zhenya HUANG2,3, Qi LIU2,3()   

  1. 1.National Education Examinations Authority,Beijing 100084,China
    2.School of Computer Science and Technology,University of Science and Technology of China,Hefei Anhui 230027,China
    3.State Key Laboratory of Cognitive Intelligence,Hefei Anhui 230088,China
  • Received:2023-09-14 Revised:2023-10-14 Accepted:2023-10-24 Online:2023-12-08 Published:2024-01-10
  • Contact: Qi LIU
  • About author:TONG Wei, born in 1984, Ph. D. His research interests include education data mining, natural language processing.
    HE Liyang, born in 1998, Ph. D. candidate. His research interests include information retrieval.
    LI Rui, born in 2000, M. S. candidate. His research interests include information retrieval.
    HUANG Wei, born in 1995, Ph. D. His research interests include education data mining, hierarchical multi-label categorization.
    HUANG Zhenya, born in 1992, Ph. D., associate professor. His research interests include data mining, text mining, knowledge reasoning, education big data analysis.
  • Supported by:
    National Education Examinations Authority(GJK2021009);National Key Research and Development Program of China(2021YFF0901003);National Natural Science Foundation of China(62106244);University Synergy Innovation Program of Anhui Province(GXXT-2022-042)

摘要:

相似题检索旨在从数据库中找到与给定查询试题考查目标相似的试题。随着在线教育的不断发展,试题数据库日益庞大,且由于试题数据的专业属性使标注相关性非常困难,因此需要一种高效且无需标注的相似题检索模型。无监督语义哈希能在无监督信号的前提下将高维数据映射为低维且高效的二值表征。但不能简单地将语义哈希模型应用在相似题检索模型中,因为试题数据具有丰富的语义信息,而二值向量的表征空间有限。为此,提出一个能获取、保留关键信息的相似题检索模型。首先,设计了一个关键信息获取模块获取试题数据的关键信息,并引入去冗余目标损失去除冗余信息;其次,在编码过程中引入随时间变化的激活函数,减少编码信息损失;再次,为了最大化利用汉明空间,在优化过程中引入比特平衡目标和比特无关目标以优化二值表征的分布。在MATH和HISTORY数据集上的实验结果表明,相较于表现最好的文本语义哈希模型DHIM (Deep Hash InfoMax),所提模型在2个数据集的3个召回率设置上分别平均提升约54%和23%;在检索效率方面,所提模型比最优的相似题检索模型QuesCo具有明显的优势。

关键词: 相似题检索, 无监督语义哈希, 表征学习, 对比学习

Abstract:

Finding similar exercises aims to retrieve exercises with similar testing goals to a given query exercise from the exercise database. As online education evolves, the exercise database is growing in size, and due to the professional characteristic of the exercises, it is not easy to annotate their relations. Thus, online education systems require an efficient and unsupervised model for finding similar exercise. Unsupervised semantic hashing can map high-dimensional data to compact and efficient binary representation under the premise of unsupervised signals. However,it is inadequate to simply apply the semantic hashing model to the similar exercise retrieval model because exercise data contains rich semantic information while the representation space of binary vector is limited. To address this issue, a similar exercise retrieval model was introduced to acquire and retain crucial information. Firstly, a crucial information acquisition module was designed to acquire critical information from exercise data and a de-redundancy object loss was proposed to eliminate redundant information. Secondly, a time-aware activation function was introduced to reduce coding information loss. Thirdly, to maximize the utilization of the Hamming space, a bit balance loss and a bit independent loss were introduced to optimize the distribution of binary representation in the optimization process. Experimental results on MATH and HISTORY datasets demonstrate that the proposed model outperforms the state-of-the-art text semantic hashing model Deep Hash InfoMax (DHIM), with an average improvement of approximately 54% and 23% respectively across three recall settings. Moreover, compared to the best-performing similar exercise retrieval model QuesCo, the proposed model demonstrates a clear advantage on search efficiency.

Key words: similar exercise retrieval, unsupervised semantic hashing, representation learning, contrastive learning

中图分类号: