《计算机应用》唯一官方网站 ›› 2021, Vol. 41 ›› Issue (12): 3551-3557.DOI: 10.11772/j.issn.1001-9081.2021050821

• 第十八届中国机器学习会议(CCML 2021) • 上一篇    

基于改进Mask R-CNN的越南场景文字检测

俸亚特1, 文益民2()   

  1. 1.桂林电子科技大学 计算机与信息安全学院,广西 桂林 541004
    2.广西图像图形与智能处理重点实验室(桂林电子科技大学),广西 桂林 541004
  • 收稿日期:2021-05-12 修回日期:2021-05-20 接受日期:2021-05-21 发布日期:2021-12-28 出版日期:2021-12-10
  • 通讯作者: 文益民
  • 作者简介:俸亚特(1996—),男,广西桂林人,硕士研究生,主要研究方向:机器学习、计算机视觉;
  • 基金资助:
    国家自然科学基金资助项目(61866007);广西自然科学基金资助项目(2018GXNSFDA138006);教育部人文社会科学研究项目(17JDGC022);广西学位与研究生教育改革项目(JGY2017055)

Vietnamese scene text detection based on modified Mask R-CNN

Yate FENG1, Yimin WEN2()   

  1. 1.School of Computer Science and Information Security,Guilin University of Electronic Technology,Guilin Guangxi 541004,China
    2.Guangxi Key Laboratory of Image and Graphic Intelligent Processing (Guilin University of Electronic Technology),Guilin Guangxi 541004,China
  • Received:2021-05-12 Revised:2021-05-20 Accepted:2021-05-21 Online:2021-12-28 Published:2021-12-10
  • Contact: Yimin WEN
  • About author:FENG Yate, born in 1996, M. S. candidate. His research interests include machine learning, computer vision.
  • Supported by:
    the National Natural Science Foundation of China(61866007);the Natural Science Foundation of Guangxi(2018GXNSFDA138006);the Humanities and Social Sciences Research Projects of the Ministry of Education(17JDGC022);the Guangxi Degree and Graduate Education Reform Project(JGY2017055)

摘要:

针对越南场景文字检测训练数据缺乏及越南文字声调符号检测不全的问题,在改进的实例分割网络Mask R-CNN的基础上,提出一种针对越南场景文字的检测算法。为了准确地分割带声调符号的越南场景文字,该算法仅使用P2特征层来分割文字区域,并将文字区域的掩码矩阵大小从14×14调整为14×28以更好地适应文字区域。针对用常规非极大值抑制(NMS)算法不能剔除重复文字检测框的问题,设计了一个针对文字区域的文本区域过滤模块并添加在检测模块之后,以有效地剔除冗余检测框。使用模型联合训练的方法训练网络,训练过程包含两部分:第一部分为特征金字塔网络(FPN)和区域生成网络(RPN)的训练,训练使用的数据集为大规模公开的拉丁文字数据,目的是增强模型在不同场景下提取文字的泛化能力;第二部分为候选框坐标回归模块和区域分割模块的训练,此部分模型参数使用像素级标注的越南场景文字数据进行训练,使模型能对包括声调符号的越南文字区域进行分割。大量交叉验证实验和对比实验结果表明,与Mask R-CNN相比,所提算法在不同的交并比(IoU)阈值下都具有更好的准确率与召回率。

关键词: Mask R-CNN, 越南场景文字检测, 声调符号, 模型联合训练, 分割模型, 重复检测

Abstract:

In view of the lack of training data for Vietnamese scene text detection and the incomplete detection of Vietnamese tone marks in the detection, a text detection algorithm for Vietnamese scenes based on a modified instance segmentation method Mask R-CNN was proposed. In order to segment Vietnamese scene text with tone marks accurately, only P2 feature layer was utilized to segment the text area, and the mask matrix size of the text area was adjusted from 14 × 14 to 14 × 28 to adapt the shape of most texts. Aiming at the problem that duplicate text detection boxes cannot be eliminated by the conventional Non-Maximum Suppression (NMS) algorithm, a filter module for the text areas named Text region filtering branch was designed and added after the detection module to effectively eliminate duplicate detection boxes. A model joint training method was used to train the network. The training process consists of two parts: the first part is the training of the Feature Pyramid Network (FPN) and the Region Proposal Network (RPN) of the model, which used large-scale open Latin text data for training to enhance the generalization ability of the model to detect text in different scenes; the second part is the training of the candidate box coordinate regression module and the segmentation module named Box branch and Mask branch, which used pixel-level labelled Vietnamese scene text data for training to enable the model to segment the Vietnamese text area including tone marks. Many cross-validation experiments and comparison experiments verify that the proposed algorithm has better precision and recall under different Intersection over Union (IoU) thresholds compared with Mask R-CNN.

Key words: Mask R-CNN, Vietnamese scene text detection, tone mark, model joint training, segmentation model, duplicate detection

中图分类号: