《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (8): 2394-2400.DOI: 10.11772/j.issn.1001-9081.2021091564
所属专题: 人工智能
收稿日期:
2021-09-03
修回日期:
2022-01-05
接受日期:
2022-01-17
发布日期:
2022-08-09
出版日期:
2022-08-10
通讯作者:
张之明
作者简介:
张显杰(1991—),男,四川绵阳人,硕士研究生,主要研究方向:图像处理、手写体识别;
Xianjie ZHANG1,2, Zhiming ZHANG1()
Received:
2021-09-03
Revised:
2022-01-05
Accepted:
2022-01-17
Online:
2022-08-09
Published:
2022-08-10
Contact:
Zhiming ZHANG
About author:
ZHANG Xianjie, born in 1991, M. S. candidate. His research interests include image processing, handwritten recognition.摘要:
手写体文本识别技术可以将手写文档转录成可编辑的数字文档。但由于手写的书写风格迥异、文档结构千变万化和字符分割识别精度不高等问题,基于神经网络的手写体英文文本识别仍面临着许多挑战。针对上述问题,提出基于卷积神经网络(CNN)和Transformer的手写体英文文本识别模型。首先利用CNN从输入图像中提取特征,而后将特征输入到Transformer编码器中得到特征序列每一帧的预测,最后经过链接时序分类(CTC)解码器获得最终的预测结果。在公开的IAM(Institut für Angewandte Mathematik)手写体英文单词数据集上进行了大量的实验结果表明,该模型获得了3.60%的字符错误率(CER)和12.70%的单词错误率(WER),验证了所提模型的可行性。
中图分类号:
张显杰, 张之明. 基于卷积神经网络和Transformer的手写体英文文本识别[J]. 计算机应用, 2022, 42(8): 2394-2400.
Xianjie ZHANG, Zhiming ZHANG. Handwritten English text recognition based on convolutional neural network and Transformer[J]. Journal of Computer Applications, 2022, 42(8): 2394-2400.
层级 | 批量大小 | CER/% | WER/% | 单张图像测试时间/ms | 深度 | 参数量/106 |
---|---|---|---|---|---|---|
conv1 | 128 | 5.50 | 18.50 | 1.98 | 1 | 94.1 |
conv2_x | 64 | 4.30 | 14.52 | 3.14 | 10 | 95.4 |
conv3_x | 32 | 5.62 | 18.73 | 6.40 | 22 | 101.0 |
conv4_x | 16 | 5.42 | 18.02 | 19.43 | 40 | 132.0 |
conv5_x | 8 | 13.52 | 38.33 | 37.92 | 49 | 197.0 |
表1 SE-ResNet-50不同截取层级的性能
Tab. 1 Performance of different interception layers of SE-ResNet-50
层级 | 批量大小 | CER/% | WER/% | 单张图像测试时间/ms | 深度 | 参数量/106 |
---|---|---|---|---|---|---|
conv1 | 128 | 5.50 | 18.50 | 1.98 | 1 | 94.1 |
conv2_x | 64 | 4.30 | 14.52 | 3.14 | 10 | 95.4 |
conv3_x | 32 | 5.62 | 18.73 | 6.40 | 22 | 101.0 |
conv4_x | 16 | 5.42 | 18.02 | 19.43 | 40 | 132.0 |
conv5_x | 8 | 13.52 | 38.33 | 37.92 | 49 | 197.0 |
模型 | 预处理 | 语言模型 | 词典 | 预训练 | CER/% | WER/% |
---|---|---|---|---|---|---|
RNN+CTC[ | — | — | — | — | — | 20.49 |
RNN+CTC[ | — | — | — | Synthetic | 6.34 | 16.19 |
√ | Synthetic | 2.66 | 5.10 | |||
RNN+CTC[ | √ | — | — | Synthetic | 4.88 | 12.61 |
√ | Synthetic | 2.17 | 4.07 | |||
RNN+Attention[ | √ | — | — | — | 8.80 | 23.80 |
√ | — | 6.20 | 12.70 | |||
Attention[ | √ | — | — | Synthetic | 5.79 | 15.15 |
√ | √ | Synthetic | 4.27 | 8.36 | ||
Attention[ | — | — | — | CTC | 12.60 | — |
CTC+Attention[ | — | — | — | — | 6.60 | 18.20 |
本文模型 | √ | — | — | — | 3.60 | 12.70 |
表2 IAM手写英文单词数据集上的评估结果比较
Tab. 2 Comparison of evaluation results on IAM handwritten English word dataset
模型 | 预处理 | 语言模型 | 词典 | 预训练 | CER/% | WER/% |
---|---|---|---|---|---|---|
RNN+CTC[ | — | — | — | — | — | 20.49 |
RNN+CTC[ | — | — | — | Synthetic | 6.34 | 16.19 |
√ | Synthetic | 2.66 | 5.10 | |||
RNN+CTC[ | √ | — | — | Synthetic | 4.88 | 12.61 |
√ | Synthetic | 2.17 | 4.07 | |||
RNN+Attention[ | √ | — | — | — | 8.80 | 23.80 |
√ | — | 6.20 | 12.70 | |||
Attention[ | √ | — | — | Synthetic | 5.79 | 15.15 |
√ | √ | Synthetic | 4.27 | 8.36 | ||
Attention[ | — | — | — | CTC | 12.60 | — |
CTC+Attention[ | — | — | — | — | 6.60 | 18.20 |
本文模型 | √ | — | — | — | 3.60 | 12.70 |
错误类型 | 占比/% |
---|---|
单词内部错误1个字母 | 41 |
单词开头或结尾错误1个字母 | 27 |
大小写错误 | 4 |
整个单词错误 | 1 |
其他 | 27 |
表3 预测错误的类型占比
Tab. 3 Proportion of types of prediction errors
错误类型 | 占比/% |
---|---|
单词内部错误1个字母 | 41 |
单词开头或结尾错误1个字母 | 27 |
大小写错误 | 4 |
整个单词错误 | 1 |
其他 | 27 |
1 | WANG Y T, XIAO W J, LI S. Offline handwritten text recognition using deep learning: a review[J]. Journal of Physics: Conference Series, 2021, 1848: No.012015. 10.1088/1742-6596/1848/1/012015 |
2 | 马洋洋,肖冰.基于CTC-Attention脱机手写体文本识别[J].激光与光电子学进展, 2021, 58(12): No.1210007. 10.3788/lop202158.1210007 |
MA Y Y, XIAO B. Offline handwritten text recognition based on CTC-Attention[J]. Laser and Optoelectronics Progress, 2021, 58(12): No.1210007. 10.3788/lop202158.1210007 | |
3 | KUMAR M, JINDAL M K, SHARMA R K. Segmentation of isolated and touching characters in offline handwritten Gurmukhi script recognition[J]. International Journal of Information Technology Computer Science, 2014, 6(2): 58-63. 10.5815/ijitcs.2014.02.08 |
4 | WANG Y W, DING X Q, LIU C S. Topic language model adaption for recognition of homologous offline handwritten Chinese text image[J]. IEEE Signal Processing Letters, 2014, 21(5): 550-553. 10.1109/lsp.2014.2308572 |
5 | ESPAÑA-BOQUERA S, CASTRO-BLEDA M J, GORBE-MOYA J, et al. Improving offline handwritten text recognition with hybrid HMM/ANN models[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(4): 767-779. 10.1109/tpami.2010.141 |
6 | WANG Z R, DU J, WANG W C, et al. A comprehensive study of hybrid neural network hidden Markov model for offline handwritten Chinese text recognition[J]. International Journal on Document Analysis Recognition, 2018, 21(4): 241-251. 10.1007/s10032-018-0307-0 |
7 | WANG Q Q, LU Y. A sequence labeling convolutional network and its application to handwritten string recognition [C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. California: ijcai.org, 2017: 2950-2956. 10.24963/ijcai.2017/411 |
8 | SUEIRAS J, RUIZ V, SÁNCHEZ Á, et al. Offline continuous handwriting recognition using sequence to sequence neural networks[J]. Neurocomputing, 2018, 289: 119-128. 10.1016/j.neucom.2018.02.008 |
9 | DUTTA K, KRISHNAN P, MATHEW M, et al. Improving CNN-RNN hybrid networks for handwriting recognition [C]// Proceedings of the 16th International Conference on Frontiers in Handwriting Recognition. Piscataway: IEEE, 2018: 80-85. 10.1109/icfhr-2018.2018.00023 |
10 | GEETHA R, THILAGAM T, PADMAVATHY T. Effective offline handwritten text recognition model based on a sequence-to-sequence approach with CNN-RNN networks[J]. Neural Computing Applications, 2021, 33(17): 10923-10934. |
11 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 6000-6010. 10.1016/s0262-4079(17)32358-8 |
12 | DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2021-06-03) [2022-01-04]. . |
13 | WANG W H, XIE E Z, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions [C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 548-558. 10.1109/iccv48922.2021.00061 |
14 | WANG W H, XIE E Z, LI X, et al. PVT v2: improved baselines with pyramid vision transformer[J]. Computational Visual Media, 2022, 8(3): 415-424. 10.1007/s41095-022-0274-8 |
15 | RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252. 10.1007/s11263-015-0816-y |
16 | GIRSHICK R. Fast R-CNN [C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 1440-1448. 10.1109/iccv.2015.169 |
17 | REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks [C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2015: 91-99. |
18 | DAI J F, HE K M, SUN J. Instance-aware semantic segmentation via multi-task network cascades [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 3150-3158. 10.1109/cvpr.2016.343 |
19 | HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN [C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2980-2988. 10.1109/iccv.2017.322 |
20 | KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks [C]// Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2012: 1097-1105. |
21 | SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10) [2022-01-04]. . |
22 | SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions [C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 1-9. 10.1109/cvpr.2015.7298594 |
23 | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. 10.1109/cvpr.2016.90 |
24 | XIE S N, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks [C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 5987-5995. 10.1109/cvpr.2017.634 |
25 | HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks [C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 2261-2269. 10.1109/cvpr.2017.243 |
26 | HU J, SHEN L, ALBANIE S, et al. Gather-excite: exploiting feature context in convolutional neural networks [C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2018: 9423-9433 |
27 | HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7132-7141. 10.1109/cvpr.2018.00745 |
28 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1(Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. |
29 | DONG L H, XU S, XU B. Speech-Transformer: a no-recurrence sequence-to-sequence model for speech recognition [C]// Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2018: 5884-5888. 10.1109/icassp.2018.8462506 |
30 | KANG L, RIBA P, RUSIÑOL M, et al. Pay attention to what you read: non-recurrent handwritten text-line recognition[J]. Pattern Recognition, 2022, 129: No.108766. 10.1016/j.patcog.2022.108766 |
31 | MOSTAFA A, MOHAMED O, ASHRAF A, et al. OCFormer: a Transformer-based model for Arabic handwritten text recognition [C]// Proceedings of the 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference. Piscataway: IEEE, 2021: 182-186. 10.1109/miucc52538.2021.9447608 |
32 | LY N T, NGUYEN C T, NAKAGAWA M. Attention augmented convolutional recurrent network for handwritten Japanese text recognition [C]// Proceedings of the 17th International Conference on Frontiers in Handwriting Recognition. Piscataway: IEEE, 2020: 163-168. 10.1109/icfhr2020.2020.00039 |
33 | GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks [C]// Proceedings of the 23rd International Conference on Machine Learning. New York: ACM, 2006: 369-376. 10.1145/1143844.1143891 |
34 | GRAVES A, LIWICKI M, FERNÁNDEZ S, et al. A novel connectionist system for unconstrained handwriting recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(5): 855-868. 10.1109/tpami.2008.137 |
35 | CHEN Z, WU Y C, YIN F, et al. Simultaneous script identification and handwriting recognition via multi-task learning of recurrent neural networks [C]// Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Piscataway: IEEE, 2017: 525-530. 10.1109/icdar.2017.92 |
36 | ZHAN H J, WANG Q Q, LU Y. Handwritten digit string recognition by combination of residual network and RNN-CTC [C]// Proceedings of the 2017 International Conference on Neural Information Processing, LNCS 10639. Cham: Springer, 2017: 583-591. |
37 | KRISHNAN P, DUTTA K, JAWAHAR C V. Word spotting and recognition using deep embedding [C]// Proceedings of the 13th IAPR International Workshop on Document Analysis Systems. Piscataway: IEEE, 2018: 1-6. 10.1109/das.2018.70 |
38 | BA J L, KIROS J R, HINTON G E. Layer normalization[EB/OL]. (2016-07-21) [2022-01-04]. . |
39 | MARTI U V, BUNKE H. The IAM-database: an English sentence database for offline handwriting recognition[J]. International Journal on Document Analysis Recognition, 2002, 5(1): 39-46. 10.1007/s100320200071 |
40 | LUO C J, ZHU Y Z, JIN L W, et al. Learn to augment: joint data augmentation and network optimization for text recognition [C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 13743-13752. 10.1109/cvpr42600.2020.01376 |
41 | MOR N, WOLF L. Confidence prediction for lexicon-free OCR [C]// Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2018: 218-225. 10.1109/wacv.2018.00030 |
42 | BLUCHE T, LOURADOUR J, MESSINA R. Scan, attend and read: end-to-end handwritten paragraph recognition with MDLSTM attention [C]// Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Piscataway: IEEE, 2017: 1050-1055. 10.1109/icdar.2017.174 |
43 | KANG L, RIBA P, VILLEGAS M, et al. Candidate fusion: integrating language modelling into a sequence-to-sequence handwritten word recognition architecture[J]. Pattern Recognition, 2021, 112: No.107790. 10.1016/j.patcog.2020.107790 |
[1] | 潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877. |
[2] | 李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910. |
[3] | 赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892. |
[4] | 李金金, 桑国明, 张益嘉. APK-CNN和Transformer增强的多域虚假新闻检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2674-2682. |
[5] | 秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974. |
[6] | 王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918. |
[7] | 方介泼, 陶重犇. 应对零日攻击的混合车联网入侵检测系统[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2763-2769. |
[8] | 李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738. |
[9] | 杨航, 李汪根, 张根生, 王志格, 开新. 基于图神经网络的多层信息交互融合算法用于会话推荐[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2719-2725. |
[10] | 贾洁茹, 杨建超, 张硕蕊, 闫涛, 陈斌. 基于自蒸馏视觉Transformer的无监督行人重识别[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2893-2902. |
[11] | 黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969. |
[12] | 杨鑫, 陈雪妮, 吴春江, 周世杰. 结合变种残差模型和Transformer的城市公路短时交通流预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2947-2951. |
[13] | 李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703. |
[14] | 任烈弘, 黄铝文, 田旭, 段飞. 基于DFT的频率敏感双分支Transformer多变量长时间序列预测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2739-2746. |
[15] | 赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||