Single direction projected Transformer method for aliasing text detection

doi:10.11772/j.issn.1001-9081.2021101749

Abstract

Abstract:

To address the performance degradation of segmentation-based text detection methods in aliasing text scenes， a Single Direction Projected Transformer （SDPT） was proposed for aliasing text detection. Firstly， multi-scale features were extracted and fused by using deep Residual Network （ResNet） and Feature Pyramid Network （FPN）. Then， the feature map was projected into a vector sequence by using horizontal projection and was fed into the Transformer module to model， thereby mining the relationship between the lines of text. Finally， joint optimization was performed using multiple objectives. Extensive experiments were conducted on the synthetic dataset BDD-SynText and the real dataset RealText. The results show that the proposed SDPT achieves optimal effect for text detection with high aliasing level， and improves F1-Score （IoU75） by at least 21. 36 percentage points on BDD-SynText and 18.11 percentage points on RealText compared with the state-of-the-art text detection algorithms such as Progressive Scale Expansion Network （PSENet） under the same backbone network （ResNet50）， verifying the important role of the proposed method for performance improvement in aliasing text detection.

Key words: computer vision, deep learning, scene text detection, aliasing text, projection, Transformer algorithm

摘要：

针对基于分割的文字检测方法在混叠文字场景下性能下降的问题，提出了单向投影Transformer （SDPT）用于混叠文本检测。首先，使用深度残差网络（ResNet）和特征金字塔网络（FPN）提取并融合多尺度特征；然后，利用水平投影将特征图投影成向量序列，并送入Transformer模块进行建模，以挖掘文本行与行之间的关系；最后，使用多目标来进行联合优化。在合成数据集BDD-SynText和真实数据集RealText上进行了大量实验，结果表明，所提SDPT在高混叠度的文字检测下取得了最优的效果，而与PSENet等文本检测算法在相同骨干网络（ResNet50）条件下相比，在BDD-SynText上F1-Score（IoU75）至少提高了21.36个百分点，在RealText上的F1-Score （IoU75）至少提高了18.11个百分点，验证了所提方法对于混叠文字检测性能改善的重要作用。

关键词: 计算机视觉, 深度学习, 场景文字检测, 混叠文字, 投影, Transformer算法

CLC Number:

TP391.4

Zhida FENG, Li CHEN. Single direction projected Transformer method for aliasing text detection[J]. Journal of Computer Applications, 2022, 42(12): 3686-3691.

冯智达, 陈黎. 面向混叠文字检测的单向投影Transformer方法[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3686-3691.

Figures/Tables 14

Fig. 1 Comparison of the proposed method and classical segmentation-based methods

Fig.2 Three situations of text aliasing

Fig.3 Example of aliasing level calculation

Fig.4 Overview flow of the proposed method

Fig.5 Multi-scale feature fusion

Fig.6 Transformer Encoder

Fig.7 Image examples of BDD-SynText

Tab.1 Comparison results of IoU50 benchmark on BDD-SynText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	88.04	73.50	79.10
PAN		76.58	68.31	71.11
DB		93.94	88.01	90.53
本文方法		98.54	98.48	98.50
PSENet	ResNet50	89.89	75.23	80.91
PAN		74.35	67.07	69.55
DB		95.24	90.22	92.39
本文方法		98.43	98.35	98.39

Tab.1 Comparison results of IoU50 benchmark on BDD-SynText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	88.04	73.50	79.10
PAN		76.58	68.31	71.11
DB		93.94	88.01	90.53
本文方法		98.54	98.48	98.50
PSENet	ResNet50	89.89	75.23	80.91
PAN		74.35	67.07	69.55
DB		95.24	90.22	92.39
本文方法		98.43	98.35	98.39

Tab.2 Comparison results of IoU75 benchmark on BDD-SynText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	74.97	64.15	68.43
PAN		78.21	75.58	76.50
DB		65.63	61.62	63.33
本文方法		92.27	92.19	92.23
PSENet	ResNet50	78.39	67.23	71.68
PAN		64.35	59.47	61.18
DB		63.44	60.25	61.64
本文方法		93.07	93.00	93.04

Tab.2 Comparison results of IoU75 benchmark on BDD-SynText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	74.97	64.15	68.43
PAN		78.21	75.58	76.50
DB		65.63	61.62	63.33
本文方法		92.27	92.19	92.23
PSENet	ResNet50	78.39	67.23	71.68
PAN		64.35	59.47	61.18
DB		63.44	60.25	61.64
本文方法		93.07	93.00	93.04

Tab.3 Comparison results of IoU50 benchmark on RealText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	98.60	90.08	94.15
PAN		54.61	51.04	52.77
DB		41.49	60.23	49.14
本文方法		97.25	97.33	97.29
PSENet	ResNet50	99.50	91.50	95.33
PAN		79.42	59.12	67.78
DB		45.92	72，80	56.32
本文方法		97.42	97.51	97.47

Tab.3 Comparison results of IoU50 benchmark on RealText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	98.60	90.08	94.15
PAN		54.61	51.04	52.77
DB		41.49	60.23	49.14
本文方法		97.25	97.33	97.29
PSENet	ResNet50	99.50	91.50	95.33
PAN		79.42	59.12	67.78
DB		45.92	72，80	56.32
本文方法		97.42	97.51	97.47

Tab.4 Comparison results of IoU75 benchmark on RealText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	84.34	77.05	80.53
PAN		42.34	39.57	40.91
DB		24.98	36.26	29.58
本文方法		97.25	97.33	97.29
PSENet	ResNet50	82.83	76.17	79.36
PAN		64.20	47.80	54.80
DB		28.93	45.86	35.48
本文方法		97.42	97.51	97.47

Tab.4 Comparison results of IoU75 benchmark on RealText dataset

方法	骨干网络	P $↑$	R $↑$	F $↑$
PSENet	ResNet18	84.34	77.05	80.53
PAN		42.34	39.57	40.91
DB		24.98	36.26	29.58
本文方法		97.25	97.33	97.29
PSENet	ResNet50	82.83	76.17	79.36
PAN		64.20	47.80	54.80
DB		28.93	45.86	35.48
本文方法		97.42	97.51	97.47

Fig.8 Visualization results of different methods

Fig.9 F1-Score of each method under different aliasing levels

Tab.5 Results of ablation experiment

实验ID	BM	SDPT	MTT	P $↑$ /%	R $↑$ /%	F $↑$ /%
1	×	×	×	87.12	88.43	87.77
2	√	×	×	90.53	91.28	90.90
3	×	√	×	90.57	90.50	90.53
4	×	×	√	90.52	91.38	90.95
5	√	√	×	89.46	89.31	89.38
6	√	×	√	89.51	90.36	89.93
7	×	√	√	91.34	91.26	91.30
8	√	√	√	92.27	92.19	92.23

Tab.5 Results of ablation experiment

实验ID	BM	SDPT	MTT	P $↑$ /%	R $↑$ /%	F $↑$ /%
1	×	×	×	87.12	88.43	87.77
2	√	×	×	90.53	91.28	90.90
3	×	√	×	90.57	90.50	90.53
4	×	×	√	90.52	91.38	90.95
5	√	√	×	89.46	89.31	89.38
6	√	×	√	89.51	90.36	89.93
7	×	√	√	91.34	91.26	91.30
8	√	√	√	92.27	92.19	92.23

References 28

1	CHEN D T， BOURLARD H， THIRAN J P. Text identification in complex background using SVM［C］// Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2001： II-621- II-626. 10.1109/cvpr.2001.990916
2	WU V， MANMATHA R， RISEMAN E M. TextFinder： an automatic system to detect and recognize text in images［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 1999， 21（11）： 1224-1229. 10.1109/34.809116
3	SRIVASTAV A， KUMAR J. Text detection in scene images using stroke width and nearest-neighbor constraints［C］// Proceedings of the 2008 IEEE Region 10 Conference. Piscataway： IEEE， 2008： 1-5. 10.1109/tencon.2008.4766826
4	MANCAS-THILLOU C， GOSSELIN B. Spatial and color spaces combination for natural scene text extraction［C］// Proceedings of the 2006 International Conference on Image Processing. Piscataway： IEEE， 2006： 985-988. 10.1109/icip.2006.312653
5	李敏花，柏猛. 基于蚁群优化算法的复杂背景图像文字检测方法［J］. 计算机应用， 2011， 31（7）： 1844-1846.
	LI M H， BAI M. Text detection from images with complex background by ant colony optimization algorithm［J］. Journal of Computer Applications， 2011， 31（7）： 1844-1846.
6	王伟强，付立波，高文，等. 基于笔画特征的叠加文字检测方法［J］. 通信学报， 2007， 28（12）： 116-120. 10.3321/j.issn:1000-436x.2007.12.019
	WANG W Q， FU L B， GAO W， et al. Text detection based on stroke features［J］. Journal on Communications， 2007， 28（12）： 116-120. 10.3321/j.issn:1000-436x.2007.12.019
7	REN S Q， HE K M， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal networks［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2015：91-99.
8	LIU W， ANGUELOV D， ERHAN D， et al. SSD： single shot multiBox detector［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9905. Cham： Springer， 2016： 21-37.
9	REDMON J， DIVVALA S， GIRSHICK R， et al. You only look once： unified， real-time object detection［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 779-788. 10.1109/cvpr.2016.91
10	JIANG Y Y， ZHU X Y， WANG X B， et al. R² CNN： rotational region CNN for orientation robust scene text detection［EB/OL］. （2017-06-30）［2021-12-28］.. 10.1109/icpr.2018.8545598
11	MA J Q， SHAO W Y， YE H， et al. Arbitrary-oriented scene text detection via rotation proposals［J］. IEEE Transactions on Multimedia， 2018， 20（11）： 3111-3122. 10.1109/tmm.2018.2818020
12	HE P， HUANG W L， HE T， et al. Single shot text detector with regional attention［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 3066-3074. 10.1109/iccv.2017.331
13	LIAO M H， SHI B G， BAI X， et al. TextBoxes： a fast text detector with a single deep neural network［C］// Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2017：4161-4167. 10.1609/aaai.v31i1.11196
14	LIAO M H， SHI B G， BAI X. TextBoxes++： a single-shot oriented scene text detector［J］. IEEE Transactions on Image Processing， 2018， 27（8）： 3676-3690. 10.1109/tip.2018.2825107
15	SHI B， BAI X， BELONGIE S. Detecting oriented text in natural images by linking segments［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 2550-2558. 10.1109/cvpr.2017.371
16	HE K M， GKIOXARI G， DOLLÁR P， et al. Mask R-CNN［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2980-2988. 10.1109/iccv.2017.322
17	CHEN L C， PAPANDREOU G， KOKKINOS I， et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs［EB/OL］（2016-06-07）［2021-12-28］.. 10.1109/tpami.2017.2699184
18	CHEN L C， PAPANDREOU G， KOKKINOS I， et al. DeepLab： Semantic image segmentation with deep convolutional nets， atrous convolution， and fully connected CRFs［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2018， 40（4）： 834-848. 10.1109/tpami.2017.2699184
19	CHEN L C， PAPANDREOU G， SCHROFF F， et al. Rethinking atrous convolution for semantic image segmentation［EB/OL］（2017-12-05）［2021-12-28］.. 10.1007/978-3-030-01234-2_49
20	CHEN L C， ZHU Y K， PAPANDREOU G， et al. Encoder-decoder with atrous separable convolution for semantic image segmentation［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11211. Cham： Springer， 2018： 833-851. 10.1007/978-3-030-01234-2_49
21	DENG D， LIU H F， LI X L， et al. PixelLink： detecting scene text via instance segmentation［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018：6773-6780. 10.1609/aaai.v32i1.12269
22	WANG W H， XIE E Z， LI X， et al. Shape robust text detection with progressive scale expansion network［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 9328-9337. 10.1109/cvpr.2019.00956
23	LIAO M H， WAN Z Y， YAO C， et al. Real-time scene text detection with differentiable binarization［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 11474-11481. 10.1609/aaai.v34i07.6812
24	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017：6000-6010.
25	LIN T Y， DOLLÁR P， GIRSHICK R， et al. Feature pyramid networks for object detection［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 936-944. 10.1109/cvpr.2017.106
26	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
27	YU F， CHEN H F， WANG X， et al. BDD100K： a diverse driving video database with scalable annotation tooling［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 2633-2642. 10.1109/cvpr42600.2020.00271
28	WANG W， XIE E， SONG X， et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network ［C］//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 8440-8449. 10.1109/iccv.2019.00853

[1]	Yunchuan HUANG, Yongquan JIANG, Juntao HUANG, Yan YANG. Molecular toxicity prediction based on meta graph isomorphism network [J]. Journal of Computer Applications, 2024, 44(9): 2964-2969.
[2]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[3]	Xiyuan WANG, Zhancheng ZHANG, Shaokang XU, Baocheng ZHANG, Xiaoqing LUO, Fuyuan HU. Unsupervised cross-domain transfer network for 3D/2D registration in surgical navigation [J]. Journal of Computer Applications, 2024, 44(9): 2911-2918.
[4]	Shunyong LI, Shiyi LI, Rui XU, Xingwang ZHAO. Incomplete multi-view clustering algorithm based on self-attention fusion [J]. Journal of Computer Applications, 2024, 44(9): 2696-2703.
[5]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.
[6]	Shuai FU, Xiaoying GUO, Ruyi BAI, Tao YAN, Bin CHEN. Age estimation method combining improved CloFormer model and ordinal regression [J]. Journal of Computer Applications, 2024, 44(8): 2372-2380.
[7]	Yuhan LIU, Genlin JI, Hongping ZHANG. Video pedestrian anomaly detection method based on skeleton graph and mixed attention [J]. Journal of Computer Applications, 2024, 44(8): 2551-2557.
[8]	Yanjie GU, Yingjun ZHANG, Xiaoqian LIU, Wei ZHOU, Wei SUN. Traffic flow forecasting via spatial-temporal multi-graph fusion [J]. Journal of Computer Applications, 2024, 44(8): 2618-2625.
[9]	Qianhong SHI, Yan YANG, Yongquan JIANG, Xiaocao OUYANG, Wubo FAN, Qiang CHEN, Tao JIANG, Yuan LI. Multi-granularity abrupt change fitting network for air quality prediction [J]. Journal of Computer Applications, 2024, 44(8): 2643-2650.
[10]	Sailong SHI, Zhiwen FANG. Gaze estimation model based on multi-scale aggregation and shared attention [J]. Journal of Computer Applications, 2024, 44(7): 2047-2054.
[11]	Yiqun ZHAO, Zhiyu ZHANG, Xue DONG. Anisotropic travel time computation method based on dense residual connection physical information neural networks [J]. Journal of Computer Applications, 2024, 44(7): 2310-2318.
[12]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[13]	Xun SUN, Ruifeng FENG, Yanru CHEN. Monocular 3D object detection method integrating depth and instance segmentation [J]. Journal of Computer Applications, 2024, 44(7): 2208-2215.
[14]	Zheng WU, Zhiyou CHENG, Zhentian WANG, Chuanjian WANG, Sheng WANG, Hui XU. Deep learning-based classification of head movement amplitude during patient anaesthesia resuscitation [J]. Journal of Computer Applications, 2024, 44(7): 2258-2263.
[15]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.