Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (4): 1035-1048.DOI: 10.11772/j.issn.1001-9081.2023040537

• Artificial intelligence • Previous Articles     Next Articles

Survey of extractive text summarization based on unsupervised learning and supervised learning

Xiawuji1,2, Heming HUANG1,2(), Gengzangcuomao1,2, Yutao FAN1,2   

  1. 1.College of Computer,Qinghai Normal University,Xining Qinghai 810008,China
    2.State Key Laboratory of Tibetan Intelligent Information Processing and Application (Qinghai Normal University),Xining Qinghai 810008,China
  • Received:2023-05-06 Revised:2023-07-19 Accepted:2023-07-25 Online:2023-12-04 Published:2024-04-10
  • Contact: Heming HUANG
  • About author:Xiawuji, born in 1982, Ph. D. candidate, associate professor. Her research interests include pattern recognition and intelligent systems, Tibetan intelligent information processing.
    HUANG Heming, born in 1969, Ph. D., professor. His research interests include pattern recognition and artificial intelligence.
    Gengzangcuomao, born in 1993, Ph. D. candidate. Her research interests include pattern recognition and intelligent systems.
    FAN Yutao, born in 1977, Ph. D. candidate, associate professor. Her research interests include pattern recognition and intelligent systems.
  • Supported by:
    National Natural Science Foundation of China(62066039);Qinghai Provincial Natural Science Foundation(2022-ZJ-925);Independent Project of State Key Laboratory of Tibetan Intelligent Information Processing and Application(2022-SKL-007)

基于无监督学习和监督学习的抽取式文本摘要综述

夏吾吉1,2, 黄鹤鸣1,2(), 更藏措毛1,2, 范玉涛1,2   

  1. 1.青海师范大学 计算机学院,西宁 810008
    2.藏语智能信息处理及应用国家重点实验室(青海师范大学),西宁 810008
  • 通讯作者: 黄鹤鸣
  • 作者简介:夏吾吉(1982—),女(藏族),青海尖扎人,副教授,博士研究生,CCF会员,主要研究方向:模式识别与智能系统、藏语智能信息处理
    黄鹤鸣(1969—),男(藏族),青海乐都人,教授,博士生导师,博士,CCF会员,主要研究方向:模式识别与人工智能 huanghm@qhnu.edu.cn
    更藏措毛(1993—),女(藏族),青海共和人,博士研究生,主要研究方向:模式识别与智能系统
    范玉涛(1977—),女,山西大同人,副教授,博士研究生,主要研究方向:模式识别与智能系统。
  • 基金资助:
    国家自然科学基金资助项目(62066039);青海省自然科学基金资助项目(2022?ZJ?925);藏语智能信息处理及应用国家重点实验室自主项目(2022?SKL?007)

Abstract:

Different from generative summarization methods, extractive summarization methods are more feasible to implement, more readable, and more widely used. At present, the literatures on extractive summarization methods mostly analyze and review some specific methods or fields, and there is no multi-faceted and multi-lingual systematic review. Therefore, the meanings of text summarization generation were discussed, related literatures were systematically reviewed, and the methods of extractive text summarization based on unsupervised learning and supervised learning were analyzed multi-dimensionally and comprehensively. First, the development of text summarization techniques was reviewed, and different methods of extractive text summarization were analyzed, including the methods based on rules, Term Frequency-Inverse Document Frequency (TF-IDF), centrality, potential semantic, deep learning, graph sorting, feature engineering, and pre-training learning, etc. Also, comparisons of advantages and disadvantages among different algorithms were made. Secondly, datasets in different languages for text summarization and popular evaluation metrics were introduced in detail. Finally, problems and challenges for research of extractive text summarization were discussed, and solutions and research trends were presented.

Key words: extractive summarization, unsupervised learning, supervised learning, dataset, evaluation metric

摘要:

相较于生成式摘要方法,抽取式摘要方法简单易行、可读性强,使用范围广。目前,抽取式摘要方法综述文献仅对特定的某个方法或领域进行分析综述,缺乏多方面、多语种的系统性综述,因此探讨文本摘要生成任务的内涵,通过系统梳理和提炼现有的相关文献,对无监督学习和监督学习的抽取式文本摘要技术进行多维度、全方位的分析。首先,回顾文本摘要技术的发展,分析不同的抽取式文本摘要方法,主要包括基于规则、词频-逆文件概率(TF-IDF)、中心性方法、潜在语义、深度学习、图排序、特征工程和预训练学习等,并对比不同方法的差异;其次,详细介绍不同语种文本摘要生成的常用数据集和主流的评价指标,通过不同的实验指标对相同数据集上的方法进行比较;最后,指出当前抽取式文本摘要研究中存在的主要问题和挑战,并提出具体的解决思路和未来发展趋势。

关键词: 抽取式摘要, 无监督学习, 监督学习, 数据集, 评价指标

CLC Number: