Handwritten English text recognition based on convolutional neural network and Transformer

doi:10.11772/j.issn.1001-9081.2021091564

Abstract

Abstract:

Handwritten text recognition technology can transcribe handwritten documents into editable digital documents. However， due to the problems of different writing styles， ever-changing document structures and low accuracy of character segmentation recognition， handwritten English text recognition based on neural networks still faces many challenges. To solve the above problems， a handwritten English text recognition model based on Convolutional Neural Network （CNN） and Transformer was proposed. Firstly， CNN was used to extract features from the input image. Then， the features were input into the Transformer encoder to obtain the prediction of each frame of the feature sequence. Finally， the Connectionist Temporal Classification （CTC） decoder was used to obtain the final prediction result. A large number of experiments were conducted on the public Institut für Angewandte Mathematik （IAM） handwritten English word dataset. Experimental results show that this model obtains a Character Error Rate （CER） of 3.60% and a Word Error Rate （WER） of 12.70%， which verify the feasibility of the proposed model.

Key words: handwritten English text recognition, deep learning, Convolutional Neural Network (CNN), Transformer, Connectionist Temporal Classification (CTC), attention, segmentation-free

摘要：

手写体文本识别技术可以将手写文档转录成可编辑的数字文档。但由于手写的书写风格迥异、文档结构千变万化和字符分割识别精度不高等问题，基于神经网络的手写体英文文本识别仍面临着许多挑战。针对上述问题，提出基于卷积神经网络（CNN）和Transformer的手写体英文文本识别模型。首先利用CNN从输入图像中提取特征，而后将特征输入到Transformer编码器中得到特征序列每一帧的预测，最后经过链接时序分类（CTC）解码器获得最终的预测结果。在公开的IAM（Institut für Angewandte Mathematik）手写体英文单词数据集上进行了大量的实验结果表明，该模型获得了3.60%的字符错误率（CER）和12.70%的单词错误率（WER），验证了所提模型的可行性。

关键词: 手写体英文文本识别, 深度学习, 卷积神经网络, Transformer, 链接时序分类, 注意力, 无分割

CLC Number:

TP391

Xianjie ZHANG, Zhiming ZHANG. Handwritten English text recognition based on convolutional neural network and Transformer[J]. Journal of Computer Applications, 2022, 42(8): 2394-2400.

张显杰, 张之明. 基于卷积神经网络和Transformer的手写体英文文本识别[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2394-2400.

Figures/Tables 10

References 43

1	WANG Y T， XIAO W J， LI S. Offline handwritten text recognition using deep learning： a review［J］. Journal of Physics： Conference Series， 2021， 1848： No.012015. 10.1088/1742-6596/1848/1/012015
2	马洋洋，肖冰.基于CTC-Attention脱机手写体文本识别［J］.激光与光电子学进展， 2021， 58（12）： No.1210007. 10.3788/lop202158.1210007
	MA Y Y， XIAO B. Offline handwritten text recognition based on CTC-Attention［J］. Laser and Optoelectronics Progress， 2021， 58（12）： No.1210007. 10.3788/lop202158.1210007
3	KUMAR M， JINDAL M K， SHARMA R K. Segmentation of isolated and touching characters in offline handwritten Gurmukhi script recognition［J］. International Journal of Information Technology Computer Science， 2014， 6（2）： 58-63. 10.5815/ijitcs.2014.02.08
4	WANG Y W， DING X Q， LIU C S. Topic language model adaption for recognition of homologous offline handwritten Chinese text image［J］. IEEE Signal Processing Letters， 2014， 21（5）： 550-553. 10.1109/lsp.2014.2308572
5	ESPAÑA-BOQUERA S， CASTRO-BLEDA M J， GORBE-MOYA J， et al. Improving offline handwritten text recognition with hybrid HMM/ANN models［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2011， 33（4）： 767-779. 10.1109/tpami.2010.141
6	WANG Z R， DU J， WANG W C， et al. A comprehensive study of hybrid neural network hidden Markov model for offline handwritten Chinese text recognition［J］. International Journal on Document Analysis Recognition， 2018， 21（4）： 241-251. 10.1007/s10032-018-0307-0
7	WANG Q Q， LU Y. A sequence labeling convolutional network and its application to handwritten string recognition ［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2017： 2950-2956. 10.24963/ijcai.2017/411
8	SUEIRAS J， RUIZ V， SÁNCHEZ Á， et al. Offline continuous handwriting recognition using sequence to sequence neural networks［J］. Neurocomputing， 2018， 289： 119-128. 10.1016/j.neucom.2018.02.008
9	DUTTA K， KRISHNAN P， MATHEW M， et al. Improving CNN-RNN hybrid networks for handwriting recognition ［C］// Proceedings of the 16th International Conference on Frontiers in Handwriting Recognition. Piscataway： IEEE， 2018： 80-85. 10.1109/icfhr-2018.2018.00023
10	GEETHA R， THILAGAM T， PADMAVATHY T. Effective offline handwritten text recognition model based on a sequence-to-sequence approach with CNN-RNN networks［J］. Neural Computing Applications， 2021， 33（17）： 10923-10934.
11	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010. 10.1016/s0262-4079(17)32358-8
12	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16x16 words： transformers for image recognition at scale［EB/OL］. （2021-06-03）［2022-01-04］. .
13	WANG W H， XIE E Z， LI X， et al. Pyramid vision transformer： a versatile backbone for dense prediction without convolutions ［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 548-558. 10.1109/iccv48922.2021.00061
14	WANG W H， XIE E Z， LI X， et al. PVT v2： improved baselines with pyramid vision transformer［J］. Computational Visual Media， 2022， 8（3）： 415-424. 10.1007/s41095-022-0274-8
15	RUSSAKOVSKY O， DENG J， SU H， et al. ImageNet large scale visual recognition challenge［J］. International Journal of Computer Vision， 2015， 115（3）： 211-252. 10.1007/s11263-015-0816-y
16	GIRSHICK R. Fast R-CNN ［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 1440-1448. 10.1109/iccv.2015.169
17	REN S Q， HE K M， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal networks ［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2015： 91-99.
18	DAI J F， HE K M， SUN J. Instance-aware semantic segmentation via multi-task network cascades ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 3150-3158. 10.1109/cvpr.2016.343
19	HE K M， GKIOXARI G， DOLLÁR P， et al. Mask R-CNN ［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2980-2988. 10.1109/iccv.2017.322
20	KRIZHEVSKY A， SUTSKEVER I， HINTON G E. ImageNet classification with deep convolutional neural networks ［C］// Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2012： 1097-1105.
21	SIMONYAN K， ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［EB/OL］. （2015-04-10）［2022-01-04］. .
22	SZEGEDY C， LIU W， JIA Y Q， et al. Going deeper with convolutions ［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 1-9. 10.1109/cvpr.2015.7298594
23	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
24	XIE S N， GIRSHICK R， DOLLÁR P， et al. Aggregated residual transformations for deep neural networks ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5987-5995. 10.1109/cvpr.2017.634
25	HUANG G， LIU Z， VAN DER MAATEN L， et al. Densely connected convolutional networks ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 2261-2269. 10.1109/cvpr.2017.243
26	HU J， SHEN L， ALBANIE S， et al. Gather-excite： exploiting feature context in convolutional neural networks ［C］// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2018： 9423-9433
27	HU J， SHEN L， SUN G. Squeeze-and-excitation networks ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
28	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding ［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1（Long and Short Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2019： 4171-4186.
29	DONG L H， XU S， XU B. Speech-Transformer： a no-recurrence sequence-to-sequence model for speech recognition ［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5884-5888. 10.1109/icassp.2018.8462506
30	KANG L， RIBA P， RUSIÑOL M， et al. Pay attention to what you read： non-recurrent handwritten text-line recognition［J］. Pattern Recognition， 2022， 129： No.108766. 10.1016/j.patcog.2022.108766
31	MOSTAFA A， MOHAMED O， ASHRAF A， et al. OCFormer： a Transformer-based model for Arabic handwritten text recognition ［C］// Proceedings of the 2021 International Mobile， Intelligent， and Ubiquitous Computing Conference. Piscataway： IEEE， 2021： 182-186. 10.1109/miucc52538.2021.9447608
32	LY N T， NGUYEN C T， NAKAGAWA M. Attention augmented convolutional recurrent network for handwritten Japanese text recognition ［C］// Proceedings of the 17th International Conference on Frontiers in Handwriting Recognition. Piscataway： IEEE， 2020： 163-168. 10.1109/icfhr2020.2020.00039
33	GRAVES A， FERNÁNDEZ S， GOMEZ F， et al. Connectionist temporal classification： labelling unsegmented sequence data with recurrent neural networks ［C］// Proceedings of the 23rd International Conference on Machine Learning. New York： ACM， 2006： 369-376. 10.1145/1143844.1143891
34	GRAVES A， LIWICKI M， FERNÁNDEZ S， et al. A novel connectionist system for unconstrained handwriting recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2009， 31（5）： 855-868. 10.1109/tpami.2008.137
35	CHEN Z， WU Y C， YIN F， et al. Simultaneous script identification and handwriting recognition via multi-task learning of recurrent neural networks ［C］// Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Piscataway： IEEE， 2017： 525-530. 10.1109/icdar.2017.92
36	ZHAN H J， WANG Q Q， LU Y. Handwritten digit string recognition by combination of residual network and RNN-CTC ［C］// Proceedings of the 2017 International Conference on Neural Information Processing， LNCS 10639. Cham： Springer， 2017： 583-591.
37	KRISHNAN P， DUTTA K， JAWAHAR C V. Word spotting and recognition using deep embedding ［C］// Proceedings of the 13th IAPR International Workshop on Document Analysis Systems. Piscataway： IEEE， 2018： 1-6. 10.1109/das.2018.70
38	BA J L， KIROS J R， HINTON G E. Layer normalization［EB/OL］. （2016-07-21）［2022-01-04］. .
39	MARTI U V， BUNKE H. The IAM-database： an English sentence database for offline handwriting recognition［J］. International Journal on Document Analysis Recognition， 2002， 5（1）： 39-46. 10.1007/s100320200071
40	LUO C J， ZHU Y Z， JIN L W， et al. Learn to augment： joint data augmentation and network optimization for text recognition ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 13743-13752. 10.1109/cvpr42600.2020.01376
41	MOR N， WOLF L. Confidence prediction for lexicon-free OCR ［C］// Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2018： 218-225. 10.1109/wacv.2018.00030
42	BLUCHE T， LOURADOUR J， MESSINA R. Scan， attend and read： end-to-end handwritten paragraph recognition with MDLSTM attention ［C］// Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Piscataway： IEEE， 2017： 1050-1055. 10.1109/icdar.2017.174
43	KANG L， RIBA P， VILLEGAS M， et al. Candidate fusion： integrating language modelling into a sequence-to-sequence handwritten word recognition architecture［J］. Pattern Recognition， 2021， 112： No.107790. 10.1016/j.patcog.2020.107790

层级	批量大小	CER/%	WER/%	单张图像测试时间/ms	深度	参数量/10⁶
conv1	128	5.50	18.50	1.98	1	94.1
conv2_x	64	4.30	14.52	3.14	10	95.4
conv3_x	32	5.62	18.73	6.40	22	101.0
conv4_x	16	5.42	18.02	19.43	40	132.0
conv5_x	8	13.52	38.33	37.92	49	197.0

层级	批量大小	CER/%	WER/%	单张图像测试时间/ms	深度	参数量/10⁶
conv1	128	5.50	18.50	1.98	1	94.1
conv2_x	64	4.30	14.52	3.14	10	95.4
conv3_x	32	5.62	18.73	6.40	22	101.0
conv4_x	16	5.42	18.02	19.43	40	132.0
conv5_x	8	13.52	38.33	37.92	49	197.0

模型	预处理	语言模型	词典	预训练	CER/%	WER/%
RNN+CTC^［41］	—	—	—	—	—	20.49
RNN+CTC^［37］	—	—	—	Synthetic	6.34	16.19
RNN+CTC^［37］	—	—	√	Synthetic	2.66	5.10
RNN+CTC^［9］	√	—	—	Synthetic	4.88	12.61
RNN+CTC^［9］	√	—	√	Synthetic	2.17	4.07
RNN+Attention^［8］	√	—	—	—	8.80	23.80
RNN+Attention^［8］	√	—	√	—	6.20	12.70
Attention^［43］	√	—	—	Synthetic	5.79	15.15
Attention^［43］	√	√	√	Synthetic	4.27	8.36
Attention^［42］	—	—	—	CTC	12.60	—
CTC+Attention^［2］	—	—	—	—	6.60	18.20
本文模型	√	—	—	—	3.60	12.70

模型	预处理	语言模型	词典	预训练	CER/%	WER/%
RNN+CTC^［41］	—	—	—	—	—	20.49
RNN+CTC^［37］	—	—	—	Synthetic	6.34	16.19
RNN+CTC^［37］	—	—	√	Synthetic	2.66	5.10
RNN+CTC^［9］	√	—	—	Synthetic	4.88	12.61
RNN+CTC^［9］	√	—	√	Synthetic	2.17	4.07
RNN+Attention^［8］	√	—	—	—	8.80	23.80
RNN+Attention^［8］	√	—	√	—	6.20	12.70
Attention^［43］	√	—	—	Synthetic	5.79	15.15
Attention^［43］	√	√	√	Synthetic	4.27	8.36
Attention^［42］	—	—	—	CTC	12.60	—
CTC+Attention^［2］	—	—	—	—	6.60	18.20
本文模型	√	—	—	—	3.60	12.70

错误类型	占比/%
单词内部错误1个字母	41
单词开头或结尾错误1个字母	27
大小写错误	4
整个单词错误	1
其他	27