End-to-end scene character detection and recognition algorithm based on differentiable architecture search

doi:10.11772/j.issn.1001-9081.2022081138

Abstract

Abstract:

When most existing methods are used for scene character detection and recognition， the processes of character detection and recognition are relatively independent， which leads to the problem slow processing speed； in addition， the training and inference processes are relatively complex， and it is difficult to design a reasonable architecture manually. To solve these problems， a Multi-Branch Automatic Selection Network （MBASNet） was proposed based on the differentiable architecture search method， which consisted of several Multi-Branch Automatic Selection Blocks （MBASBs）. The MBASB could automatically search the subbranch structure with better performance， and the subnetwork did not significantly increase the computational cost. Multiple MBASBs were combined to obtain the whole detection and recognition network. The proposed MBASNet could train the detection and the recognition subnetworks at the same time， which reduced the difficulty of network training and inference in character detection and recognition tasks， meanwhile， it improved the detection and recognition speed. The proposed MBASNet achieved 89.4% precision and 91.4% recall on the ICDAR2013 dataset， 80.5% precision and 86.8% recall on the ICDAR15 dataset， and the computational speed reached 68 Frames Per Second （FPS）.

Key words: deep learning, Convolutional Neural Network (CNN), text detection, character recognition, differentiable architecture search

摘要：

在自然场景文字检测和识别任务中，现有大多数方法的文字检测和文字识别过程相对独立，导致这些方法处理速度较慢；此外，这些方法的训练和推理过程较为复杂，并且手工设计合理的架构比较困难。针对以上这些问题，基于可微分架构搜索方法提出了多分支自动选择网络（MBASNet），该网络由数个多分支自动选择块（MBASB）组成。MBASB能在不显著增加计算量的情况下通过自动搜索检测和识别性能较优的子分支结构，组合多个MBASB得到整个检测和识别网络。所提出的MBASNet可以同时训练检测子网络和识别子网络，降低文字检测和识别任务中网络的训练和推理难度，提高对文字的检测和识别速度。MBASNet在ICDAR2013数据集上取得了89.4%的精确率和91.4%的召回率，在ICDAR15数据集上取得了80.5%的精确率和86.8%的召回率，并且计算速度达到了每秒68帧。

关键词: 深度学习, 卷积神经网络, 文本检测, 文字识别, 可微分架构搜索

CLC Number:

TP391.41

Jiayi LIU, Dongping CAO, Yong ZHONG. End-to-end scene character detection and recognition algorithm based on differentiable architecture search[J]. Journal of Computer Applications, 2023, 43(S1): 81-87.

刘嘉艺, 曹冬平, 钟勇. 基于可微分架构搜索的端到端场景文字检测及识别算法[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 81-87.

Figures/Tables 11

References 31

1	赵龙，李飞，王伟峰. 基于PSENet和CRNN的身份证识别［J］. 现代计算机， 2020（34）： 78-82. 10.3969/j.issn.1007-1423.2020.34.017
2	王鹏飞，黄汉明，王梦琪.改进YOLOv5的复杂道路目标检测算法［J］.计算机工程与应用，2022，58（17）：81-92. 10.3778/j.issn.1002-8331.2205-0158
3	YAO C， BAI X， SHI B， et al. Strokelets： A learned multi-scale representation for scene text recognition［C］// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2014： 4042-4049. 10.1109/cvpr.2014.515
4	SHI B， YANG M， WANG X， et al. ASTER： An attentional scene text recognizer with flexible rectification［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2018， 41（9）： 2035-2048. 10.1109/tpami.2018.2848939
5	LIAO M， ZHANG J， WAN Z， et al. Scene text recognition from two-dimensional perspective［EB/OL］. ［2022-10-23］. . 10.1609/aaai.v33i01.33018714
6	LECUN Y， BOSER B， DENKER J S， et al. Backpropagation applied to handwritten zip code recognition［J］. Neural Computation， 1989， 1（4）： 541-551. 10.1162/neco.1989.1.4.541
7	ZAREMBA W， SUTSKEVER I， VINYALS O. Recurrent neural network regularization［EB/OL］. ［2022-10-23］. .
8	SUBAKAN C， RAVANELLI M， CORNELL S， et al. Attention is all you need in speech separation［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 21-25. 10.1109/icassp39728.2021.9413901
9	ATASHIN A A， GHIASI-SHIRAZI K， HARATI A. Training LDCRF model on unsegmented sequences using connectionist temporal classification［C］// Proceedings of the 2016 6th International Conference on Computer and Knowledge Engineering. Piscataway： IEEE， 2016： 280-285. 10.1109/iccke.2016.7802153
10	ZOPH B， VASUDEVAN V， SHLENS J， et al. Learning transferable architectures for scalable image recognition［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 8697-8710. 10.1109/cvpr.2018.00907
11	HUANG W， QIAO Y， TANG X. Robust scene text detection with convolution neural network induced MSER trees［C］// Proceedings of the 2014 European Conference on Computer Vision. Cham： Springer， 2014： 497-511. 10.1007/978-3-319-10593-2_33
12	LIAO M， SHI B， BAI X. TextBoxes++： a single-shot oriented scene text detector［J］. IEEE Transactions on Image Processing， 2018， 27（8）： 3676-3690. 10.1109/tip.2018.2825107
13	YAO C， BAI X， SANG N， et al. Scene text detection via holistic， multi-channel prediction［EB/OL］. ［2022-10-23］. .
14	HE P， HUANG W， HE T， et al. Single shot text detector with regional attention［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 3047-3055. 10.1109/iccv.2017.331
15	SHI B， BAI X， BELONGIE S. Detecting oriented text in natural images by linking segments［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 2550-2558. 10.1109/cvpr.2017.371
16	LIAO M， ZHU Z， SHI B， et al. Rotation-sensitive regression for oriented scene text detection［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 5909-5918. 10.1109/cvpr.2018.00619
17	SHI B， BAI X， YAO C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2016， 39（11）： 2298-2304. 10.1109/tpami.2016.2646371
18	SHI B， WANG X， LYU P， et al. Robust scene text recognition with automatic rectification［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 4168-4176. 10.1109/cvpr.2016.452
19	BAEK J， MATSUI Y， AIZAWA K. What if we only use real datasets for scene text recognition？ toward scene text recognition with fewer labels［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 3113-3122. 10.1109/cvpr46437.2021.00313
20	DU Y， LI C， GUO R， et al. PP-OCR： a practical ultra lightweight OCR system［EB/OL］. ［2022-10-23］. .
21	HU W， CAI X， HOU J， et al. GTC： guided training of CTC towards efficient and accurate scene text recognition［EB/OL］. ［2022-10-23］. . 10.1609/aaai.v34i07.6735
22	JADERBERG M， SIMONYAN K， VEDALDI A， et al. Reading text in the wild with convolutional neural networks［J］. International Journal of Computer Vision， 2016， 116（1）： 1-20. 10.1007/s11263-015-0823-z
23	MA J， SHAO W， YE H， et al. Arbitrary-oriented scene text detection via rotation proposals［J］. IEEE Transactions on Multimedia， 2018， 20（11）： 3111-3122. 10.1109/tmm.2018.2818020
24	LI H， WANG P， SHEN C. Towards end-to-end text spotting with convolutional recurrent neural networks［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 5238-5246. 10.1109/iccv.2017.560
25	BUSTA M， NEUMANN L， MATAS J. Deep TextSpotter： an end-to-end trainable scene text localization and recognition framework［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2204-2212. 10.1109/iccv.2017.242
26	CHEN X， XIE L， WU J， et al. Progressive differentiable architecture search： bridging the depth gap between search and evaluation［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 1294-1303. 10.1109/iccv.2019.00138
27	ZHANG H， YAO Q， YANG M， et al. AutoSTR： efficient backbone search for scene text recognition［C］// Proceedings of the 2020 European Conference on Computer Vision. Cham： Springer， 2020： 751-767. 10.1007/978-3-030-58586-0_44
28	BAEK Y， LEE B， HAN D， et al. Character region awareness for text detection［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 9365-9374. 10.1109/cvpr.2019.00959
29	LIAO M， LYU P， HE M， et al. Mask TextSpotter： an end-to-end trainable neural network for spotting text with arbitrary shapes［EB/OL］. ［2022-10-23］. . 10.1109/tpami.2019.2937086
30	JIANG Y， ZHU X， WANG X， et al. R2 CNN： Rotational region CNN for orientation robust scene text detection［EB/OL］. ［2022-10-23］. . 10.1109/icpr.2018.8545598
31	DENG D， LIU H， LI X， et al. PixelLink： detecting scene text via instance segmentation［EB/OL］. ［2022-10-23］. . 10.1609/aaai.v32i1.12269

网络层	输入张量大小	输出张量大小
Conv1	N×3×32×32	N×128×32×32
Maxpool1	N×128×32×32	N×128×16×16
MBASB1	N×128×16×16	N×128×16×16
Conv2	N×128×16×16	N×256×16×16
Maxpool2	N×256×16×16	N×256×8×8
MBASB2	N×256×8×8	N×256×8×8
Conv3	N×256×8×8	N×512×8×8
Maxpool3	N×512×8×8	N×512×4×4
MBASB3	N×512×4×4	N×512×4×4
Conv4	N×512×4×4	N×1 024×4×4
Maxpool4	N×1 024×4×4	N×1 024×2×2
MBASB4	N×1 024×2×2	N×1 024×2×2
GAP	N×1 024×2×2	N×1 024×1×1
FC	N×1 024	N×1

网络层	输入张量大小	输出张量大小
Conv1	N×3×32×32	N×128×32×32
Maxpool1	N×128×32×32	N×128×16×16
MBASB1	N×128×16×16	N×128×16×16
Conv2	N×128×16×16	N×256×16×16
Maxpool2	N×256×16×16	N×256×8×8
MBASB2	N×256×8×8	N×256×8×8
Conv3	N×256×8×8	N×512×8×8
Maxpool3	N×512×8×8	N×512×4×4
MBASB3	N×512×4×4	N×512×4×4
Conv4	N×512×4×4	N×1 024×4×4
Maxpool4	N×1 024×4×4	N×1 024×2×2
MBASB4	N×1 024×2×2	N×1 024×2×2
GAP	N×1 024×2×2	N×1 024×1×1
FC	N×1 024	N×1

MBASB名	子分支类型	输入张量大小	输出张量大小
MBASB1	B₅	N×512×24×24	N×256×24×24
MBASB2	B₂	N×512×48×48	N×256×48×48
MBASB3	B₂	N×384×96×96	N×128×96×96
MBASB4	B₂	N×192×192×192	N×64×192×192
MBASB5	B₅	N×128×384×384	N×128×384×384

MBASB名	子分支类型	输入张量大小	输出张量大小
MBASB1	B₅	N×512×24×24	N×256×24×24
MBASB2	B₂	N×512×48×48	N×256×48×48
MBASB3	B₂	N×384×96×96	N×128×96×96
MBASB4	B₂	N×192×192×192	N×64×192×192
MBASB5	B₅	N×128×384×384	N×128×384×384

MBASB名	子分支类型	输入张量大小	输出张量大小
MBASB1	B₂	N×128×16×16	N×128×16×16
MBASB2	B₂	N×256×8×8	N×256×8×8
MBASB3	B₂	N×512×4×4	N×512×4×4
MBASB4	B₁	N×1 024×2×2	N×1 024×2×2