Real-time Scene Text Detection with Differentiable ...

Real-time Scene Text Detection with Differentiable BinarizationMinghui Liao1 , Zhaoyi Wan2 , Cong Yao2, Kai Chen3,4, Xiang Bai1 1 Huazhong University of Science and Technology,2 Megvii,3 Shanghai Jiao Tong University,4 Onlyou segmentation-based methods are quite popular inscene text Detection , as the segmentation results can more ac-curately describe Scene text of various shapes such as curvetext. However, the post-processing of binarization is essen-tial for segmentation-based Detection , which converts proba-bility maps produced by a segmentation method into bound-ing boxes/regions of text. In this paper, we propose a mod-ule named Differentiable Binarization (DB), which can per-form the binarization process in a segmentation network. Op-timized along with a DB module, a segmentation network canadaptively set the thresholds for binarization, which not onlysimplifies the post-processing but also enhances the perfor-mance of text Detection .

Based on a simple segmentation net-work, we validate the performance improvements of DB onfive benchmark datasets, which consistently achieves state-of-the-art results, in terms of both Detection accuracy andspeed. In particular, with a light-weight backbone, the per-formance improvements by DB are significant so that wecan look for an ideal tradeoff between Detection accuracyand efficiency. Specifically, with a backbone of ResNet-18,our detector achieves an F-measure of , running at 62 FPS, on the MSRA-TD500 dataset. Code is available at: recent years, reading text in Scene images has becomean active research area, due to its wide practical applicationssuch as image/video understanding, visual search, automaticdriving, and blind a key component of Scene text reading, Scene textdetection that aims to localize the bounding box or re-gion of each text instance is still a challenging task, sincescene text is often with various scales and shapes, includinghorizontal, multi-oriented and curved text.

Segmentation-based Scene text Detection has attracted a lot of attentionrecently, as it can describe the text of various shapes, ben-efiting from its prediction results at the pixel-level. How-ever, most segmentation-based methods require complex Authors contribute equally. Corresponding authorCopyrightc 2020, Association for the Advancement of ArtificialIntelligence ( ). All rights (%)Speed (FPS)DB (ours)CRAFT (2019)TextSnake (2018)Corner (2018)RRD (2018)PixelLink (2018)SegLink (2017)DB-ResNet-50DB-ResNet-18 Figure 1: The comparisons of several recent Scene text detec-tion methods on the MSRA-TD500 dataset, in terms of bothaccuracy and speed. Our method achieves the ideal tradeoffbetween effectiveness and for grouping the pixel-level prediction re-sults into detected text instances, resulting in a considerabletime cost in the inference procedure.

Take two recent state-of-the-art methods for Scene text Detection as examples:PSENet (Wang et al. 2019a) proposed the post-processingof progressive scale expansion for improving the detectionaccuracies; Pixel embedding in (Tian et al. 2019) is usedfor clustering the pixels based on the segmentation results,which has to calculate the feature distances among existing Detection methods use the similar post-processing pipeline as shown in Fig. 2 (following the bluearrows): Firstly, they set a fixed threshold for convertingthe probability map produced by a segmentation networkinto a binary image; Then, some heuristic techniques likepixel clustering are used for grouping pixels into text in-stances. Alternatively, our pipeline (following the red arrowsin Fig. 2) aims to insert the binarization operation into asegmentation network for joint optimization.

In this manner,the threshold value at every place of an image can be adap-tively predicted, which can fully distinguish the pixels fromthe foreground and background. However, the standard bina-rization function is not Differentiable , we instead present anapproximate function for binarization called DifferentiableBinarization (DB), which is fully Differentiable when train- [ ] 3 Dec 2019imagesegmentation mapbinarization mapdetection resultsthreshold mapFigure 2: Traditional pipeline (blue flow) and our pipeline(red flow). Dashed arrows are the inference only operators;solid arrows indicate Differentiable operators in both trainingand it along with a segmentation major contribution in this paper is the proposed DBmodule that is Differentiable , which makes the process ofbinarization end-to-end trainable in a CNN. By combininga simple network for semantic segmentation and the pro-posed DB module, we proposed a robust and fast Scene textdetector.

Observed from the performance evaluation of us-ing the DB module, we discover that our detector has sev-eral prominent advantages over the previous state-of-the-artsegmentation-based Our method achieves consistently better performances onfive benchmark datasets of Scene text, including horizon-tal, multi-oriented and curved Our method performs much faster than the previous lead-ing methods, as DB can provide a highly robust binariza-tion map, significantly simplifying the DB works quite well when using a light-weight backbone,which significantly enhances the Detection performancewith the backbone of As DB can be removed in the inference stage without sac-rificing the performance, there is no extra memory/timecost for WorkRecent Scene text Detection methods can be roughly clas-sified into two categories: Regression-based methods andsegmentation-based methodsare a series of models whichdirectly regress the bounding boxes of the text (Liao et al.)

2017) modified the anchors and thescale of the convolutional kernels based on SSD (Liu et ) for text Detection . TextBoxes++ (Liao, Shi, and Bai2018) and DMPNet (Liu and Jin 2017) applied quadrilater-als regression to detect multi-oriented text. SSTD (He et ) proposed an attention mechanism to roughly identi-fies text regions. RRD (Liao et al. 2018) decoupled the clas-sification and regression by using rotation-invariant featuresfor classification and rotation-sensitive features for regres-sion, for better effect on multi-oriented and long text in-stances. EAST (Zhou et al. 2017) and DeepReg (He et ) are anchor-free methods, which applied pixel-levelregression for multi-oriented text instances. SegLink (Shi,Bai, and Belongie 2017) regressed the segment boundingboxes and predicted their links, to deal with long text in-stances.

DeRPN (Xie et al. 2019b) proposed a dimension-decomposition region proposal network to handle the scaleproblem in Scene text Detection . Regression-based methodsusually enjoy simple post-processing algorithms ( non-maximum suppression). However, most of them are limitedto represent accurate bounding boxes for irregular shapes,such as curved methodsusually combine pixel-levelprediction and post-processing algorithms to get the bound-ing boxes. (Zhang et al. 2016) detected multi-oriented textby semantic segmentation and MSER-based border is used in (Xue, Lu, and Zhan 2018) to splitthe text instances, Mask TextSpotter (Lyu et al. 2018a;Liao et al. 2019) detected arbitrary-shape text instances inan instance segmentation manner based on Mask (Wang et al. 2019a) proposed progressive scaleexpansion by segmenting the text instances with differentscale kernel.

Pixel embedding is proposed in (Tian et ) to cluster the pixels from the segmentation (Wang et al. 2019a) and SAE (Tian et al. 2019)proposed new post-processing algorithms for the segmen-tation results, resulting in lower inference speed. Instead,our method focus on improving the segmentation results byincluding the binarization process into the training period,without the loss of the inference Scene text Detection methodsfocus on both the accu-racy and the inference speed. TextBoxes (Liao et al. 2017),TextBoxes++ (Liao, Shi, and Bai 2018), SegLink (Shi, Bai,and Belongie 2017), and RRD (Liao et al. 2018) achievedfast text Detection by following the Detection architecture ofSSD (Liu et al. 2016). EAST (Zhou et al. 2017) proposed toapply PVANet (Kim et al. 2016) to improve its speed. Mostof them can not deal with text instances of irregular shapes,such as curved shape.

Compared to the previous fast scenetext detectors, our method not only runs faster but also candetect text instances of arbitrary architecture of our proposed method is shown in Fig. , the input image is fed into a feature-pyramid back-bone. Secondly, the pyramid features are up-sampled to thesame scale and cascaded to produce featureF. Then, fea-tureFis used to predict both the probability map (P) andthe threshold map (T). After that, the approximate binarymap ( B) is calculated byPandF. In the training period, thesupervision is applied on the probability map, the thresholdmap, and the approximate binary map, where the probabil-ity map and the approximate binary map share the same su-pervision. In the inference period, the bounding boxes canbe obtained easily from the approximate binary map or theprobability map by a box formulation binarizationGiven a probability mapP RH Wproduced by a segmentation network, whereHandWindicate the height and width of the map, it is essential toconvert it to a binary mapP RH W, where pixels with+++CON-CATconvconv, up 2conv, up 4conv, up 8up 2up 2up 2predpred+up NconvElement-wise SumUp-sample with ratio N3*3 convolution1/21/41/81/161/321/4threshold mapprobability mapDBapproximate binary mapboxformationFigure 3: Architecture of our proposed method, where pred consists of a3 3convolutional operator and two de-convolutionaloperators with stride 2.

Real-time Scene Text Detection with Differentiable ...

Information

Transcription of Real-time Scene Text Detection with Differentiable ...

Related search queries

Real-time Scene Text Detection with Differentiable ...

Information

Documents from same domain

Related documents

Related search queries