基于可微分架构搜索的端到端场景文字检测及识别算法

doi:10.11772/j.issn.1001-9081.2022081138

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (S1): 81-87.DOI: 10.11772/j.issn.1001-9081.2022081138

基于可微分架构搜索的端到端场景文字检测及识别算法

刘嘉艺¹^,², 曹冬平¹^,², 钟勇¹^,²()

^1.中国科学院成都计算机应用研究所，成都 610041
^2.中国科学院大学，北京100049

收稿日期:2022-08-22 修回日期:2022-10-26 接受日期:2022-11-14 发布日期:2023-07-04 出版日期:2023-06-30
通讯作者: 钟勇
作者简介:刘嘉艺（1996—），男，四川内江人，硕士研究生，主要研究方向：人工智能、计算机视觉
曹冬平（1992—），男，四川成都人，博士，主要研究方向：图像处理、模式识别
钟勇（1966—），男，四川岳池人，研究员，博士，CCF会员，主要研究方向：大数据、人工智能、软件过程技术与方法。zhongyong@casit.com.cn
基金资助:
四川省科技成果转化计划项目(2020ZHZY0002)

End-to-end scene character detection and recognition algorithm based on differentiable architecture search

Jiayi LIU¹^,², Dongping CAO¹^,², Yong ZHONG¹^,²()

^1.Chengdu Institute of Computer Applications，Chinese Academy of Sciences，Chengdu Sichuan 610041，China
^2.University of Chinese Academy of Sciences，Beijing 100049，China

Received:2022-08-22 Revised:2022-10-26 Accepted:2022-11-14 Online:2023-07-04 Published:2023-06-30
Contact: Yong ZHONG

摘要/Abstract

摘要：

在自然场景文字检测和识别任务中，现有大多数方法的文字检测和文字识别过程相对独立，导致这些方法处理速度较慢；此外，这些方法的训练和推理过程较为复杂，并且手工设计合理的架构比较困难。针对以上这些问题，基于可微分架构搜索方法提出了多分支自动选择网络（MBASNet），该网络由数个多分支自动选择块（MBASB）组成。MBASB能在不显著增加计算量的情况下通过自动搜索检测和识别性能较优的子分支结构，组合多个MBASB得到整个检测和识别网络。所提出的MBASNet可以同时训练检测子网络和识别子网络，降低文字检测和识别任务中网络的训练和推理难度，提高对文字的检测和识别速度。MBASNet在ICDAR2013数据集上取得了89.4%的精确率和91.4%的召回率，在ICDAR15数据集上取得了80.5%的精确率和86.8%的召回率，并且计算速度达到了每秒68帧。

关键词: 深度学习, 卷积神经网络, 文本检测, 文字识别, 可微分架构搜索

Abstract:

When most existing methods are used for scene character detection and recognition， the processes of character detection and recognition are relatively independent， which leads to the problem slow processing speed； in addition， the training and inference processes are relatively complex， and it is difficult to design a reasonable architecture manually. To solve these problems， a Multi-Branch Automatic Selection Network （MBASNet） was proposed based on the differentiable architecture search method， which consisted of several Multi-Branch Automatic Selection Blocks （MBASBs）. The MBASB could automatically search the subbranch structure with better performance， and the subnetwork did not significantly increase the computational cost. Multiple MBASBs were combined to obtain the whole detection and recognition network. The proposed MBASNet could train the detection and the recognition subnetworks at the same time， which reduced the difficulty of network training and inference in character detection and recognition tasks， meanwhile， it improved the detection and recognition speed. The proposed MBASNet achieved 89.4% precision and 91.4% recall on the ICDAR2013 dataset， 80.5% precision and 86.8% recall on the ICDAR15 dataset， and the computational speed reached 68 Frames Per Second （FPS）.

Key words: deep learning, Convolutional Neural Network (CNN), text detection, character recognition, differentiable architecture search

中图分类号:

TP391.41

刘嘉艺, 曹冬平, 钟勇. 基于可微分架构搜索的端到端场景文字检测及识别算法[J]. 计算机应用, 2023, 43(S1): 81-87.

Jiayi LIU, Dongping CAO, Yong ZHONG. End-to-end scene character detection and recognition algorithm based on differentiable architecture search[J]. Journal of Computer Applications, 2023, 43(S1): 81-87.

图/表 11

参考文献 31

1	赵龙，李飞，王伟峰. 基于PSENet和CRNN的身份证识别［J］. 现代计算机， 2020（34）： 78-82. 10.3969/j.issn.1007-1423.2020.34.017
2	王鹏飞，黄汉明，王梦琪.改进YOLOv5的复杂道路目标检测算法［J］.计算机工程与应用，2022，58（17）：81-92. 10.3778/j.issn.1002-8331.2205-0158
3	YAO C， BAI X， SHI B， et al. Strokelets： A learned multi-scale representation for scene text recognition［C］// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2014： 4042-4049. 10.1109/cvpr.2014.515
4	SHI B， YANG M， WANG X， et al. ASTER： An attentional scene text recognizer with flexible rectification［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2018， 41（9）： 2035-2048. 10.1109/tpami.2018.2848939
5	LIAO M， ZHANG J， WAN Z， et al. Scene text recognition from two-dimensional perspective［EB/OL］. ［2022-10-23］. . 10.1609/aaai.v33i01.33018714
6	LECUN Y， BOSER B， DENKER J S， et al. Backpropagation applied to handwritten zip code recognition［J］. Neural Computation， 1989， 1（4）： 541-551. 10.1162/neco.1989.1.4.541
7	ZAREMBA W， SUTSKEVER I， VINYALS O. Recurrent neural network regularization［EB/OL］. ［2022-10-23］. .
8	SUBAKAN C， RAVANELLI M， CORNELL S， et al. Attention is all you need in speech separation［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 21-25. 10.1109/icassp39728.2021.9413901
9	ATASHIN A A， GHIASI-SHIRAZI K， HARATI A. Training LDCRF model on unsegmented sequences using connectionist temporal classification［C］// Proceedings of the 2016 6th International Conference on Computer and Knowledge Engineering. Piscataway： IEEE， 2016： 280-285. 10.1109/iccke.2016.7802153
10	ZOPH B， VASUDEVAN V， SHLENS J， et al. Learning transferable architectures for scalable image recognition［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 8697-8710. 10.1109/cvpr.2018.00907
11	HUANG W， QIAO Y， TANG X. Robust scene text detection with convolution neural network induced MSER trees［C］// Proceedings of the 2014 European Conference on Computer Vision. Cham： Springer， 2014： 497-511. 10.1007/978-3-319-10593-2_33
12	LIAO M， SHI B， BAI X. TextBoxes++： a single-shot oriented scene text detector［J］. IEEE Transactions on Image Processing， 2018， 27（8）： 3676-3690. 10.1109/tip.2018.2825107
13	YAO C， BAI X， SANG N， et al. Scene text detection via holistic， multi-channel prediction［EB/OL］. ［2022-10-23］. .
14	HE P， HUANG W， HE T， et al. Single shot text detector with regional attention［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 3047-3055. 10.1109/iccv.2017.331
15	SHI B， BAI X， BELONGIE S. Detecting oriented text in natural images by linking segments［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 2550-2558. 10.1109/cvpr.2017.371
16	LIAO M， ZHU Z， SHI B， et al. Rotation-sensitive regression for oriented scene text detection［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 5909-5918. 10.1109/cvpr.2018.00619
17	SHI B， BAI X， YAO C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2016， 39（11）： 2298-2304. 10.1109/tpami.2016.2646371
18	SHI B， WANG X， LYU P， et al. Robust scene text recognition with automatic rectification［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 4168-4176. 10.1109/cvpr.2016.452
19	BAEK J， MATSUI Y， AIZAWA K. What if we only use real datasets for scene text recognition？ toward scene text recognition with fewer labels［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 3113-3122. 10.1109/cvpr46437.2021.00313
20	DU Y， LI C， GUO R， et al. PP-OCR： a practical ultra lightweight OCR system［EB/OL］. ［2022-10-23］. .
21	HU W， CAI X， HOU J， et al. GTC： guided training of CTC towards efficient and accurate scene text recognition［EB/OL］. ［2022-10-23］. . 10.1609/aaai.v34i07.6735
22	JADERBERG M， SIMONYAN K， VEDALDI A， et al. Reading text in the wild with convolutional neural networks［J］. International Journal of Computer Vision， 2016， 116（1）： 1-20. 10.1007/s11263-015-0823-z
23	MA J， SHAO W， YE H， et al. Arbitrary-oriented scene text detection via rotation proposals［J］. IEEE Transactions on Multimedia， 2018， 20（11）： 3111-3122. 10.1109/tmm.2018.2818020
24	LI H， WANG P， SHEN C. Towards end-to-end text spotting with convolutional recurrent neural networks［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 5238-5246. 10.1109/iccv.2017.560
25	BUSTA M， NEUMANN L， MATAS J. Deep TextSpotter： an end-to-end trainable scene text localization and recognition framework［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2204-2212. 10.1109/iccv.2017.242
26	CHEN X， XIE L， WU J， et al. Progressive differentiable architecture search： bridging the depth gap between search and evaluation［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 1294-1303. 10.1109/iccv.2019.00138
27	ZHANG H， YAO Q， YANG M， et al. AutoSTR： efficient backbone search for scene text recognition［C］// Proceedings of the 2020 European Conference on Computer Vision. Cham： Springer， 2020： 751-767. 10.1007/978-3-030-58586-0_44
28	BAEK Y， LEE B， HAN D， et al. Character region awareness for text detection［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 9365-9374. 10.1109/cvpr.2019.00959
29	LIAO M， LYU P， HE M， et al. Mask TextSpotter： an end-to-end trainable neural network for spotting text with arbitrary shapes［EB/OL］. ［2022-10-23］. . 10.1109/tpami.2019.2937086
30	JIANG Y， ZHU X， WANG X， et al. R2 CNN： Rotational region CNN for orientation robust scene text detection［EB/OL］. ［2022-10-23］. . 10.1109/icpr.2018.8545598
31	DENG D， LIU H， LI X， et al. PixelLink： detecting scene text via instance segmentation［EB/OL］. ［2022-10-23］. . 10.1609/aaai.v32i1.12269

网络层	输入张量大小	输出张量大小
Conv1	N×3×32×32	N×128×32×32
Maxpool1	N×128×32×32	N×128×16×16
MBASB1	N×128×16×16	N×128×16×16
Conv2	N×128×16×16	N×256×16×16
Maxpool2	N×256×16×16	N×256×8×8
MBASB2	N×256×8×8	N×256×8×8
Conv3	N×256×8×8	N×512×8×8
Maxpool3	N×512×8×8	N×512×4×4
MBASB3	N×512×4×4	N×512×4×4
Conv4	N×512×4×4	N×1 024×4×4
Maxpool4	N×1 024×4×4	N×1 024×2×2
MBASB4	N×1 024×2×2	N×1 024×2×2
GAP	N×1 024×2×2	N×1 024×1×1
FC	N×1 024	N×1

网络层	输入张量大小	输出张量大小
Conv1	N×3×32×32	N×128×32×32
Maxpool1	N×128×32×32	N×128×16×16
MBASB1	N×128×16×16	N×128×16×16
Conv2	N×128×16×16	N×256×16×16
Maxpool2	N×256×16×16	N×256×8×8
MBASB2	N×256×8×8	N×256×8×8
Conv3	N×256×8×8	N×512×8×8
Maxpool3	N×512×8×8	N×512×4×4
MBASB3	N×512×4×4	N×512×4×4
Conv4	N×512×4×4	N×1 024×4×4
Maxpool4	N×1 024×4×4	N×1 024×2×2
MBASB4	N×1 024×2×2	N×1 024×2×2
GAP	N×1 024×2×2	N×1 024×1×1
FC	N×1 024	N×1

MBASB名	子分支类型	输入张量大小	输出张量大小
MBASB1	B₅	N×512×24×24	N×256×24×24
MBASB2	B₂	N×512×48×48	N×256×48×48
MBASB3	B₂	N×384×96×96	N×128×96×96
MBASB4	B₂	N×192×192×192	N×64×192×192
MBASB5	B₅	N×128×384×384	N×128×384×384

MBASB名	子分支类型	输入张量大小	输出张量大小
MBASB1	B₅	N×512×24×24	N×256×24×24
MBASB2	B₂	N×512×48×48	N×256×48×48
MBASB3	B₂	N×384×96×96	N×128×96×96
MBASB4	B₂	N×192×192×192	N×64×192×192
MBASB5	B₅	N×128×384×384	N×128×384×384

MBASB名	子分支类型	输入张量大小	输出张量大小
MBASB1	B₂	N×128×16×16	N×128×16×16
MBASB2	B₂	N×256×8×8	N×256×8×8
MBASB3	B₂	N×512×4×4	N×512×4×4
MBASB4	B₁	N×1 024×2×2	N×1 024×2×2

基于可微分架构搜索的端到端场景文字检测及识别算法

End-to-end scene character detection and recognition algorithm based on differentiable architecture search

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 31

相关文章 15

编辑推荐

Metrics

方法	ICDAR13（DetEval）		ICDAR15		FPS
方法	召回率/%	精确率/%	召回率/%	精确率/%	FPS
SegLink	83.0	87.7	76.8	73.1	20.6
SSTD	86.0	89.0	73.0	80.0	7.7
Mask TextSpotter	88.1	94.1	81.2	85.8	4.8
R2CNN	82.6	93.6	79.7	85.6	0.4
PixelLink	87.5	88.6	82.0	85.5	3.0
本文方法	89.4	91.4	80.5	86.8	68.4

[1]	李毅仁, 申培. 面向废钢回收业务需求的异物检测方法[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 243-249.
[2]	沈权猷, 张小波, 李文豪, 李礼汉, 许荣德, 陈道花, 李静. U-Net在肺结节分割中的应用进展[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 250-257.
[3]	胡众义, 张夏彬. 利用3D-RepVGG进行阿尔兹海默症诊断[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 26-32.
[4]	钟侠骄, 张绍兵, 郭静, 王胜朝, 成苗, 何莲, 赵铱民. 基于RandLA-Net的3D点云牙颌分割与身份识别[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 269-275.
[5]	汪雪林, 杜丽学, 陈德近, 张夏清, 许涛, 陈亚新, 余章卫. 基于深度学习和双目视觉的汽车油箱外盖定位[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 281-287.
[6]	陈旭东, 钟恒, 皇甫洁, 吕高冲, 王成, 王德良, 童凯. 脑电信号情绪识别综述[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 323-332.
[7]	徐宛扬, 李文根, 关佶红. 面向金融网页数据的异构表格信息提取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 56-60.
[8]	尤庆丽, 李国勇. 基于孪生网络的离线手写签名鉴别算法[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 45-48.
[9]	刘希未, 宫晓燕, 赵红霞, 边思宇, 邵帅, 戴亚平, 代文鑫. 基于混合注意力机制的动态人脸表情识别[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 1-7.
[10]	江魁, 余志航, 陈小雷, 李宇豪. 基于BERT-CNN的Webshell流量检测系统设计与实现[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 126-132.
[11]	郑超, 邬悦婷, 肖珂. 基于联邦学习和深度残差网络的入侵检测[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 133-138.
[12]	王栋, 张显, 李达, 郭庆雷, 常新, 冯景丽. 基于分布式异常检测的电网区块链安全防护方案[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 139-146.
[13]	谭朋柳, 徐光勇, 张露玉, 王润庶. 基于卷积神经网络和Adaboost的心脏病预测模型[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 19-25.
[14]	崔子良, 句媛媛, 刘冬冬, 戴琳, 肖清泰. 基于深度卷积神经网络的气液两相流图像分割方法[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 217-223.
[15]	陆靖桥, 宾炜, 卢永锵, 麦广柱, 陈银, 伍雁雄. 结合注意力互斥正则的细粒度图像分类[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 224-228.