Single direction projected Transformer method for aliasing text detection

doi:10.11772/j.issn.1001-9081.2021101749

Abstract

Abstract:

To address the performance degradation of segmentation-based text detection methods in aliasing text scenes， a Single Direction Projected Transformer （SDPT） was proposed for aliasing text detection. Firstly， multi-scale features were extracted and fused by using deep Residual Network （ResNet） and Feature Pyramid Network （FPN）. Then， the feature map was projected into a vector sequence by using horizontal projection and was fed into the Transformer module to model， thereby mining the relationship between the lines of text. Finally， joint optimization was performed using multiple objectives. Extensive experiments were conducted on the synthetic dataset BDD-SynText and the real dataset RealText. The results show that the proposed SDPT achieves optimal effect for text detection with high aliasing level， and improves F1-Score （IoU75） by at least 21. 36 percentage points on BDD-SynText and 18.11 percentage points on RealText compared with the state-of-the-art text detection algorithms such as Progressive Scale Expansion Network （PSENet） under the same backbone network （ResNet50）， verifying the important role of the proposed method for performance improvement in aliasing text detection.

Key words: computer vision, deep learning, scene text detection, aliasing text, projection, Transformer algorithm

摘要：

针对基于分割的文字检测方法在混叠文字场景下性能下降的问题，提出了单向投影Transformer （SDPT）用于混叠文本检测。首先，使用深度残差网络（ResNet）和特征金字塔网络（FPN）提取并融合多尺度特征；然后，利用水平投影将特征图投影成向量序列，并送入Transformer模块进行建模，以挖掘文本行与行之间的关系；最后，使用多目标来进行联合优化。在合成数据集BDD-SynText和真实数据集RealText上进行了大量实验，结果表明，所提SDPT在高混叠度的文字检测下取得了最优的效果，而与PSENet等文本检测算法在相同骨干网络（ResNet50）条件下相比，在BDD-SynText上F1-Score（IoU75）至少提高了21.36个百分点，在RealText上的F1-Score （IoU75）至少提高了18.11个百分点，验证了所提方法对于混叠文字检测性能改善的重要作用。

关键词: 计算机视觉, 深度学习, 场景文字检测, 混叠文字, 投影, Transformer算法

CLC Number:

TP391.4

Zhida FENG, Li CHEN. Single direction projected Transformer method for aliasing text detection[J]. Journal of Computer Applications, 2022, 42(12): 3686-3691.

冯智达, 陈黎. 面向混叠文字检测的单向投影Transformer方法[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3686-3691.

Figures/Tables 14

Fig. 1 Comparison of the proposed method and classical segmentation-based methods

Fig.2 Three situations of text aliasing

Fig.3 Example of aliasing level calculation

Fig.4 Overview flow of the proposed method

Fig.5 Multi-scale feature fusion

Fig.6 Transformer Encoder

Fig.7 Image examples of BDD-SynText

Tab.1 Comparison results of IoU50 benchmark on BDD-SynText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	88.04	73.50	79.10
PAN		76.58	68.31	71.11
DB		93.94	88.01	90.53
本文方法		98.54	98.48	98.50
PSENet	ResNet50	89.89	75.23	80.91
PAN		74.35	67.07	69.55
DB		95.24	90.22	92.39
本文方法		98.43	98.35	98.39

Tab.1 Comparison results of IoU50 benchmark on BDD-SynText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	88.04	73.50	79.10
PAN		76.58	68.31	71.11
DB		93.94	88.01	90.53
本文方法		98.54	98.48	98.50
PSENet	ResNet50	89.89	75.23	80.91
PAN		74.35	67.07	69.55
DB		95.24	90.22	92.39
本文方法		98.43	98.35	98.39

Tab.2 Comparison results of IoU75 benchmark on BDD-SynText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	74.97	64.15	68.43
PAN		78.21	75.58	76.50
DB		65.63	61.62	63.33
本文方法		92.27	92.19	92.23
PSENet	ResNet50	78.39	67.23	71.68
PAN		64.35	59.47	61.18
DB		63.44	60.25	61.64
本文方法		93.07	93.00	93.04

Tab.2 Comparison results of IoU75 benchmark on BDD-SynText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	74.97	64.15	68.43
PAN		78.21	75.58	76.50
DB		65.63	61.62	63.33
本文方法		92.27	92.19	92.23
PSENet	ResNet50	78.39	67.23	71.68
PAN		64.35	59.47	61.18
DB		63.44	60.25	61.64
本文方法		93.07	93.00	93.04

Tab.3 Comparison results of IoU50 benchmark on RealText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	98.60	90.08	94.15
PAN		54.61	51.04	52.77
DB		41.49	60.23	49.14
本文方法		97.25	97.33	97.29
PSENet	ResNet50	99.50	91.50	95.33
PAN		79.42	59.12	67.78
DB		45.92	72，80	56.32
本文方法		97.42	97.51	97.47

Tab.3 Comparison results of IoU50 benchmark on RealText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	98.60	90.08	94.15
PAN		54.61	51.04	52.77
DB		41.49	60.23	49.14
本文方法		97.25	97.33	97.29
PSENet	ResNet50	99.50	91.50	95.33
PAN		79.42	59.12	67.78
DB		45.92	72，80	56.32
本文方法		97.42	97.51	97.47

Tab.4 Comparison results of IoU75 benchmark on RealText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	84.34	77.05	80.53
PAN		42.34	39.57	40.91
DB		24.98	36.26	29.58
本文方法		97.25	97.33	97.29
PSENet	ResNet50	82.83	76.17	79.36
PAN		64.20	47.80	54.80
DB		28.93	45.86	35.48
本文方法		97.42	97.51	97.47

Tab.4 Comparison results of IoU75 benchmark on RealText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	84.34	77.05	80.53
PAN		42.34	39.57	40.91
DB		24.98	36.26	29.58
本文方法		97.25	97.33	97.29
PSENet	ResNet50	82.83	76.17	79.36
PAN		64.20	47.80	54.80
DB		28.93	45.86	35.48
本文方法		97.42	97.51	97.47

Fig.8 Visualization results of different methods

Fig.9 F1-Score of each method under different aliasing levels

Tab.5 Results of ablation experiment

实验ID	BM	SDPT	MTT	P $↑$ /%	R $↑$ /%	F $↑$ /%
1	×	×	×	87.12	88.43	87.77
2	√	×	×	90.53	91.28	90.90
3	×	√	×	90.57	90.50	90.53
4	×	×	√	90.52	91.38	90.95
5	√	√	×	89.46	89.31	89.38
6	√	×	√	89.51	90.36	89.93
7	×	√	√	91.34	91.26	91.30
8	√	√	√	92.27	92.19	92.23

Tab.5 Results of ablation experiment

实验ID	BM	SDPT	MTT	P $↑$ /%	R $↑$ /%	F $↑$ /%
1	×	×	×	87.12	88.43	87.77
2	√	×	×	90.53	91.28	90.90
3	×	√	×	90.57	90.50	90.53
4	×	×	√	90.52	91.38	90.95
5	√	√	×	89.46	89.31	89.38
6	√	×	√	89.51	90.36	89.93
7	×	√	√	91.34	91.26	91.30
8	√	√	√	92.27	92.19	92.23

References 28

1	CHEN D T， BOURLARD H， THIRAN J P. Text identification in complex background using SVM［C］// Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2001： II-621- II-626. 10.1109/cvpr.2001.990916
2	WU V， MANMATHA R， RISEMAN E M. TextFinder： an automatic system to detect and recognize text in images［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 1999， 21（11）： 1224-1229. 10.1109/34.809116
3	SRIVASTAV A， KUMAR J. Text detection in scene images using stroke width and nearest-neighbor constraints［C］// Proceedings of the 2008 IEEE Region 10 Conference. Piscataway： IEEE， 2008： 1-5. 10.1109/tencon.2008.4766826
4	MANCAS-THILLOU C， GOSSELIN B. Spatial and color spaces combination for natural scene text extraction［C］// Proceedings of the 2006 International Conference on Image Processing. Piscataway： IEEE， 2006： 985-988. 10.1109/icip.2006.312653
5	李敏花，柏猛. 基于蚁群优化算法的复杂背景图像文字检测方法［J］. 计算机应用， 2011， 31（7）： 1844-1846.
	LI M H， BAI M. Text detection from images with complex background by ant colony optimization algorithm［J］. Journal of Computer Applications， 2011， 31（7）： 1844-1846.
6	王伟强，付立波，高文，等. 基于笔画特征的叠加文字检测方法［J］. 通信学报， 2007， 28（12）： 116-120. 10.3321/j.issn:1000-436x.2007.12.019
	WANG W Q， FU L B， GAO W， et al. Text detection based on stroke features［J］. Journal on Communications， 2007， 28（12）： 116-120. 10.3321/j.issn:1000-436x.2007.12.019
7	REN S Q， HE K M， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal networks［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2015：91-99.
8	LIU W， ANGUELOV D， ERHAN D， et al. SSD： single shot multiBox detector［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9905. Cham： Springer， 2016： 21-37.
9	REDMON J， DIVVALA S， GIRSHICK R， et al. You only look once： unified， real-time object detection［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 779-788. 10.1109/cvpr.2016.91
10	JIANG Y Y， ZHU X Y， WANG X B， et al. R² CNN： rotational region CNN for orientation robust scene text detection［EB/OL］. （2017-06-30）［2021-12-28］.. 10.1109/icpr.2018.8545598
11	MA J Q， SHAO W Y， YE H， et al. Arbitrary-oriented scene text detection via rotation proposals［J］. IEEE Transactions on Multimedia， 2018， 20（11）： 3111-3122. 10.1109/tmm.2018.2818020
12	HE P， HUANG W L， HE T， et al. Single shot text detector with regional attention［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 3066-3074. 10.1109/iccv.2017.331
13	LIAO M H， SHI B G， BAI X， et al. TextBoxes： a fast text detector with a single deep neural network［C］// Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2017：4161-4167. 10.1609/aaai.v31i1.11196
14	LIAO M H， SHI B G， BAI X. TextBoxes++： a single-shot oriented scene text detector［J］. IEEE Transactions on Image Processing， 2018， 27（8）： 3676-3690. 10.1109/tip.2018.2825107
15	SHI B， BAI X， BELONGIE S. Detecting oriented text in natural images by linking segments［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 2550-2558. 10.1109/cvpr.2017.371
16	HE K M， GKIOXARI G， DOLLÁR P， et al. Mask R-CNN［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2980-2988. 10.1109/iccv.2017.322
17	CHEN L C， PAPANDREOU G， KOKKINOS I， et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs［EB/OL］（2016-06-07）［2021-12-28］.. 10.1109/tpami.2017.2699184
18	CHEN L C， PAPANDREOU G， KOKKINOS I， et al. DeepLab： Semantic image segmentation with deep convolutional nets， atrous convolution， and fully connected CRFs［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2018， 40（4）： 834-848. 10.1109/tpami.2017.2699184
19	CHEN L C， PAPANDREOU G， SCHROFF F， et al. Rethinking atrous convolution for semantic image segmentation［EB/OL］（2017-12-05）［2021-12-28］.. 10.1007/978-3-030-01234-2_49
20	CHEN L C， ZHU Y K， PAPANDREOU G， et al. Encoder-decoder with atrous separable convolution for semantic image segmentation［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11211. Cham： Springer， 2018： 833-851. 10.1007/978-3-030-01234-2_49
21	DENG D， LIU H F， LI X L， et al. PixelLink： detecting scene text via instance segmentation［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018：6773-6780. 10.1609/aaai.v32i1.12269
22	WANG W H， XIE E Z， LI X， et al. Shape robust text detection with progressive scale expansion network［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 9328-9337. 10.1109/cvpr.2019.00956
23	LIAO M H， WAN Z Y， YAO C， et al. Real-time scene text detection with differentiable binarization［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 11474-11481. 10.1609/aaai.v34i07.6812
24	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017：6000-6010.
25	LIN T Y， DOLLÁR P， GIRSHICK R， et al. Feature pyramid networks for object detection［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 936-944. 10.1109/cvpr.2017.106
26	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
27	YU F， CHEN H F， WANG X， et al. BDD100K： a diverse driving video database with scalable annotation tooling［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 2633-2642. 10.1109/cvpr42600.2020.00271
28	WANG W， XIE E， SONG X， et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network ［C］//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 8440-8449. 10.1109/iccv.2019.00853

[1]	Jinghu LI, Qianguo XING, Xiangyang ZHENG, Lin LI, Lili WANG. Noctiluca scintillans red tide extraction method from UAV images based on deep learning [J]. Journal of Computer Applications, 2022, 42(9): 2969-2974.
[2]	Jiaxuan WEI, Shikang DU, Zhixuan YU, Ruisheng ZHANG. Review of white-box adversarial attack technologies in image classification [J]. Journal of Computer Applications, 2022, 42(9): 2732-2741.
[3]	Jinghan YIN, Shaojun QU, Zekai YAO, Xuanye HU, Xiaoyu QIN, Pujing HUA. Traffic sign recognition model in haze weather based on YOLOv5 [J]. Journal of Computer Applications, 2022, 42(9): 2876-2884.
[4]	Yining WANG, Qingshan ZHAO, Pinle QIN, Yulan HU, Chunmei ZONG. Super-resolution reconstruction algorithm of medical image based on lightweight dense neural network [J]. Journal of Computer Applications, 2022, 42(8): 2586-2592.
[5]	Yajiao LIU, Haitao YU, Jiang WANG, Lifeng YU, Chunhui ZHANG. Surface detection algorithm of multi-shape small defects for section steel based on deep learning [J]. Journal of Computer Applications, 2022, 42(8): 2601-2608.
[6]	Xianjie ZHANG, Zhiming ZHANG. Handwritten English text recognition based on convolutional neural network and Transformer [J]. Journal of Computer Applications, 2022, 42(8): 2394-2400.
[7]	Nanjiang CHENG, Zhenxia YU, Lin CHEN, Hezhe QIAO. Multi-source and multi-label pedestrian attribute recognition based on domain adaptation [J]. Journal of Computer Applications, 2022, 42(8): 2401-2406.
[8]	Yaru HAN, Lianshan YAN, Tao YAO. Deep hashing retrieval algorithm based on meta-learning [J]. Journal of Computer Applications, 2022, 42(7): 2015-2021.
[9]	Wanjun LIU, Jiaming WANG, Haicheng QU, Libing DONG, Xinyu CAO. Music genre classification algorithm based on attention spectral-spatial feature [J]. Journal of Computer Applications, 2022, 42(7): 2072-2077.
[10]	Ning DONG, Xiaorong CHENG, Mingquan ZHANG. Intrusion detection system with dynamic weight loss function based on internet of things platform [J]. Journal of Computer Applications, 2022, 42(7): 2118-2124.
[11]	Zhenyu WANG, Lei ZHANG, Wenbin GAO, Weiming QUAN. Human activity recognition based on progressive neural architecture search [J]. Journal of Computer Applications, 2022, 42(7): 2058-2064.
[12]	Tingwei QIN, Pengcheng ZHAO, Pinle QIN, Jianchao ZENG, Rui CHAI, Yongqi HUANG. Point cloud registration algorithm based on residual attention mechanism [J]. Journal of Computer Applications, 2022, 42(7): 2184-2191.
[13]	Jing JIANG, Yu CHEN, Jieping SUN, Shenggen JU. Integrating posterior probability calibration training into text classification algorithm [J]. Journal of Computer Applications, 2022, 42(6): 1789-1795.
[14]	Min WEN, Rongcun WANG, Shujuan JIANG. Source code vulnerability detection based on relational graph convolution network [J]. Journal of Computer Applications, 2022, 42(6): 1814-1821.
[15]	Yang ZHANG, Jiangbo HAO. Malicious code detection method based on attention mechanism and residual network [J]. Journal of Computer Applications, 2022, 42(6): 1708-1715.