CCML2021+352：基于改进Mask R-CNN的越南场景文字检测

• •

CCML2021+352：基于改进Mask R-CNN的越南场景文字检测

俸亚特¹,文益民²

1. 桂林电子科技大学计算机与信息安全学院
2. 桂林电子科技大学计算机科学与工程学院，广西桂林 541004；

收稿日期:2021-05-19 修回日期:2021-05-20 发布日期:2021-05-20
通讯作者: 文益民

CCML2021+352：Vietnamese Scene Text Detection Based on A Modified Mask R-CNN

Received:2021-05-19 Revised:2021-05-20 Online:2021-05-20
Contact: WEN Yimin

摘要/Abstract

摘要： 摘要: 针对越南场景文字检测训练数据缺乏及越南文字声调符号检测不全的问题，在改进的实例分割网络Mask R-CNN的基础上，提出一种针对越南场景文字的检测算法。为了准确地分割带声调符号的越南场景文本，该算法仅使用高分辨率的P2特征层来分割文本区域，并将文本区域的掩码矩阵大小从14×14调整为14×28以更好地适应文字区域的横纵比。针对存在用常规非极大值抑制算法不能剔除重复文本检测框的问题，设计了一个针对文本区域的过滤模块添加在检测模块之后，以有效地剔除冗余检测框。使用模型联合训练的方法训练网络，训练过程包含两部分：第一部分为特征金字塔网络和区域生成网络的训练，训练使用的数据集为大规模公开的拉丁文字数据，以增强模型在不同场景下提取文字的泛化能力；第二部分为候选框坐标回归模块和区域分割模块的训练，此部分模型参数使用像素级越南场景文字数据训练,使模型能对包括声调符号的越南文字区域进行分割。大量交叉验证实验和对比实验证实了本文提出的算法与Mask R-CNN相比，在不同的交并比(IOU)阈值下都具有更好的准确率与召回率。

关键词: 关键词: 越南场景文字检测, 声调符号, 模型联合训练, 分割模型, 重复检测

Abstract: This paper presents a text detection algorithm for Vietnamese scenes, based on a modified instance segmentation method Mask R-CNN, aiming at the lack of training data for Vietnamese scene text detection and the incomplete detection of Vietnamese tone markers in the detection. In order to segment Vietnamese scene text with tone marker accurately, only the feature map in P2 layer is utilized to segment the text area, and the mask matrix size of the text area is adjusted to 14 × 28 from 14 × 14 to adapt the shape of most texts. To overcome the problem that duplicate text detection boxes cannot be eliminated by the non-maximum suppression algorithm, this paper designs a filter module for the text areas is added to the detection module, effectively eliminate duplicate detection boxes. This paper uses a model joint training method to train the network. The training process consists of two parts: The first part is the training of The Feature Pyramid Network and The Region Proposal Network of the model. The model use open Latin text data on an enormous scale training to enhance the generalization ability of the model to detect text in different scenes; The second part is the training of the candidate box coordinate regression module and the segmentation module, this part use pixel-level Vietnamese scene text data training to enable the model to segment the Vietnamese text area including tone markers. Many cross-validation experiments and comparative experiments verify that the proposed method has better Precision and Recall under different Intersection over Union (IOU) thresholds compared with Mask R-CNN.

Key words: Keywords: Vietnamese scene text detection, tone marker, model joint training, segmentation module, duplicate detection

中图分类号:

TP391.1

俸亚特文益民. CCML2021+352：基于改进Mask R-CNN的越南场景文字检测[J]. 计算机应用.