PGNet: Real-time Arbitrarily-Shaped Text Spotting with ...

PRELIMINARY VERSION: DO NOT CITE. The AAAI Digital Library will contain the published version some time after the conference PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network Pengfei Wang,1 Chengquan Zhang,2 Fei Qi,1 Shanshan Liu,2 Xiaoqiang Zhang,2. Pengyuan Lyu,2 Junyu Han,2 Jingtuo Liu,2 Errui Ding,2 Guangming Shi1. 1. School of Artificial Intelligence, Xidian University, 2 Department of Computer Vision Technology, Baidu Inc. {liushanshan07, {hanjunyu, liujingtuo, Abstract Accuracy The reading of Arbitrarily-Shaped text has received increasing PGNet-A. research attention, but existing text spotters are mostly built 70. CharNet-H57 with GRM. on two-stage frameworks or character-based methods, which suffer from either Non-Maximum Suppression (NMS) and 60. Region-of-Interest (RoI) operations or character-level anno- TextNet ABCNet-F. PGNet-E. tations. In this paper, to address above problems, we propose a novel fully convolutional Point Gathering Network (PGNet) 50.}}

Mask TextSpotter for reading Arbitrarily-Shaped text in Real-time . PGNet is a single-shot text spotter, where the pixel-level character clas- 40. sification map is learned with proposed PG-CTC loss avoid- ing the usage of character-level annotations. with PG-CTC Textboxes 30 Speed(FPS). decoder, we gather high-level character classification vectors from two-dimensional space and decode them into text sym- 10 20 30 40 50. bols without NMS and RoI operations involved, which guar- antees high efficiency. Additionally, reasoning the relations Figure 1: Model Speed vs. Recognition Accuracy on Total- between each character and its neighbors, a graph refinement Text: Our PGNet-E achieves at least two times faster than module (GRM) is proposed to optimize the coarse recogni- the most recently state-of-the-art method ABCNet (Liu et al. tion and further improve the end-to-end performance. Exper- 2020) with competitive recognition accuracy. Complete re- iments demonstrate that the proposed method achieves state- sults are in Table.

3. of-the-art or competitive accuracy, meanwhile significantly improving the running speed. In particular, on Total-Text, it runs at FPS, surpassing the previous spotters with a large margin. Reading of Arbitrarily-Shaped scene text is a challenging task, as compared in Fig. 2, and the most recent works may suffer from the following disadvantages: (1) The pipelines 1 Introduction of two-stage methods (Sun et al. 2018; Lyu et al. 2018;. Recently, scene text reading has attracted extensive atten- Feng et al. 2019; Liu et al. 2020) are inefficient, which tion in both academia and industry for its numerous appli- may involve time -consuming Non-maximum Suppression cations, such as scene understanding, image retrieval, aug- (NMS) and Region of Interest (RoI) operations. Especially mented reality translation (Wu et al. 2019), and robot navi- for Arbitrarily-Shaped text spotter, specific RoI transforma- gation. Thanks to the surge of deep neural networks, signif- tion operation, such as RoISlide (Feng et al.)

2019) or Bezier- icant progress has been made in detection and recognition Align (Liu et al. 2020), brings non-negligible computa- separable solutions (Wu and Natarajan 2017; Long et al. tional overhead. (2) In Mask TextSpotter (Lyu et al. 2018). 2018; Wang et al. 2019b,a; Zhan and Lu 2019; Shi et al. and CharNet (Xing et al. 2019), character-level annotations 2018; Yu et al. 2020; Wan et al. 2019), as well as end- are required for training, which is too expensive to afford. to-end text Spotting methods. However, existing end-to-end Though CharNet could be trained in a weakly supervised models (Sun et al. 2018; Liu et al. 2018; Feng et al. 2019) manner by character-level annotations in synthetic datasets, are mostly built on two-stage frameworks or character-based free synthesized data is not completely replaceable for real methods (Xing et al. 2019; Lyu et al. 2018) with a complex data in practice. (3) The recognition of text in non-traditional pipeline, which are inefficient for Real-time applications.

In reading directions would be failed with pre-defined rules. this paper, we try to investigate a Real-time text spotter for For example, TextDragon (Feng et al. 2019) and Mask Arbitrarily-Shaped text. TextSpotter (Lyu et al. 2018) make a strong assumption that . Equal contribution. Fei Qi is the corresponding author. the reading direction of text region is either from left to right Copyright c 2021, Association for the Advancement of Artificial or from up to down, which precludes correct recognition of Intelligence ( ). All rights reserved. more challenging text. (a) (Li et al., ICCV2017) IMAGE GT:W CNN Proposal Generation ROI Pooling Recognizer Det. Rec. H. (b) (Liu et al., IMAGE GT:W,C CNN Det. ROI Rotate Recognizer Rec. H Q. ICCV2017). (c) (Liu et al., IMAGE GT:W CNN Det. BezierAlign Recognizer Rec. H Q A. CVPR2020). (d) (Liao et al., IMAGE GT:W,C CNN Proposal Generation ROI Align Det. Seg. & Char Rec. Grouping Rec. H Q A. ECCV2018). (e) (Xing et al., Det. Char Det. IMAGE GT:W,C CNN Group Rec.)

H Q A. ICCV2019) Seg. & Char Rec. (f) (Feng et al., Component Det. ICCV2019) IMAGE GT:W CNN Group RoISlide Recognizer Rec. Det. H Q A. Seg. (g) Ours IMAGE GT:W CNN Seg. & Char Rec. Group Det. Rec. H Q A. Figure 2: Overview of some end-to-end scene text Spotting methods that are most relevant to ours, and the blue and green boxes represent their detection and recognition results. Inside the GT (ground-truth) box, W' and C' represent word-level and character-level annotation. The H', Q', and A' represent that the method can detect horizontal, quadrilateral, and arbitrarily - shaped text, respectively. Our method is free from character-level annotations, NMS, and RoI operations. In this paper, we propose a novel framework for read- characters in each text instance, making our method able ing text in Real-time speed with point gathering operation, to correctly recognize text in more challenging situations namely PGNet. The PGNet is a single-shot text spotter based and non-traditional reading directions.

On multi-task learning. The architecture of PGNet is shown We also propose an efficient graph refinement module in Fig. 3, we employ a FCN (Milletari, Navab, and Ahmadi (GRM) to improve the CTC recognition. 2016) model to learn various information of text regions si- multaneously, including text center line (TCL), text border offset (TBO), text direction offset (TDO), and text character 2 Related Work classification map (TCC). The pixel-level character classifi- In this section, we will review some representative scene text cation map is trained with a proposed Point Gathering CTC spotters, as well as some recent progress in graph neural net- (PG-CTC) loss, making it free from character-level annota- works. A comprehensive review of recent scene text spotters tions. In the post-processing, we extract the center point se- can be found in (Ye and Doermann 2015; Zhu, Yao, and Bai quence in reading order of each text instance with TCL and 2016; Baek et al. 2019). TDO maps, and the detection results can be obtained with Scene Text Spotting .

Inspired by the generic object de- the corresponding boundary offset information from TBO tection methods (Liu et al. 2016; Ren et al. 2015; Red- map. Using the PG-CTC decoder, we serialize high-level mon et al. 2016) and segmentation methods (He et al. two-dimensional TCC map to character classification prob- 2017; Milletari, Navab, and Ahmadi 2016), the text spot- ability vector sequences which can be further decoded to the ting methods are developed from Spotting regular scene recognition results. The details will be discussed in text to Spotting Arbitrarily-Shaped scene text. Lee and Osin- As depicted in Fig. 2 and Fig. 1, our pipeline is simple yet dero (2016) proposed the first successful end-to-end text efficient, and experiments on public benchmarks prove that recognition model which only supports horizontal text and PGNet achieves better or competitive performance in end- requires relatively complex training procedures. To address to-end performance with excellent running speed.

The multi-orientation problem of text, Bus ta et al. (2017) uti- Moreover, inspired by SRN (Yu et al. 2020) and GTC (Hu lize YOLO (Redmon et al. 2016) to generate rotational pro- et al. 2020), we propose a graph refinement module (GRM) posals, and train RoI sampled features with CTC loss. In- to make secondary reasoning to further improve the end-to- spired by the Faster RCNN (Ren et al. 2015), TextNet (Sun end performance. The points in a text sequence are formu- et al. 2018) generates text proposals in quadrangles, and en- lated as nodes in a graph, where the representation of each code the aligned RoI features into context information with node is enhanced with semantic context and visual context a simple recurrent neural network to generate the text se- information from its neighbors, and the character classifica- quences, which contains some background information, thus tion result should be more accurate. it may suffer from reading curve texts . The contributions of this paper are three-fold: For the Spotting of Arbitrarily-Shaped scene text, Mask We propose a simple yet powerful Arbitrarily-Shaped text TextSpotter (Lyu et al.)

2018) detects and recognizes text in- spotter (PGNet), which is free from character-level anno- stances of arbitrary shapes by segmenting the text regions tations, NMS and RoI operations, and it achieves better or and character regions. However, the character-level anno- competitive performance in end-to-end performance with tations are required for training, which is too expensive to excellent running speed; afford. Considering the Arbitrarily-Shaped region of text as a series of quadrangles, TextDragon (Feng et al. 2019) ex- We introduce a mechanism to restore the reading order of tracts the components of text feature through RoISlide and Input Image: 3 TBO Detection Branch . 4 RANDYS. 4 4 Instance Info. TCL CC DONUTS.. 1. 4 4. Center Point Seq. TDO. #'(%)*"+. 2. 4 4 RR-A-N-DD--Y-SS. !"#$%&. 1 1 37 64. TCC*. PG-CTC Decoder D-OO-N--UU-T-S.. Feature Extraction 37. 4 4 *The TCC map is trained without character-level annotations. Network Inference Post Process Figure 3: The pipeline of PGNet: 1) Extract feature from an input image, and learn TCL, TBO, TDO, TCC maps as a multi-task problem; 2) The detection and recognition of each text instance can be achieved in a single shot by polygon restoration and PG-CTC decoding mechanism with the center point sequence of each text region.

PGNet: Real-time Arbitrarily-Shaped Text Spotting with ...

Tags:

Information

Transcription of PGNet: Real-time Arbitrarily-Shaped Text Spotting with ...

Related search queries

PGNet: Real-time Arbitrarily-Shaped Text Spotting with ...

Tags:

Information

Documents from same domain

Related documents

Related search queries