Transcription of ERFNet: Efficient Residual Factorized ConvNet for Real …
1 1 ERFNet: Efficient Residual Factorized ConvNet forReal-time Semantic SegmentationEduardo Romera1, Jos e M. Alvarez2, Luis M. Bergasa1and Roberto Arroyo1 Abstract Semantic segmentation is a challenging task thataddresses most of the perception needs of Intelligent Vehicles (IV)in an unified way. Deep Neural Networks excel at this task, asthey can be trained end-to-end to accurately classify multipleobject categories in an image at pixel level. However, a goodtrade-off between high quality and computational resources is yetnot present in state-of-the-art semantic segmentation approaches,limiting their application in real vehicles.
2 In this paper, wepropose a deep architecture that is able to run in real-timewhile providing accurate semantic segmentation. The core ofour architecture is a novel layer that uses Residual connectionsand Factorized convolutions in order to remain efficient whileretaining remarkable accuracy. Our approach is able to runat over 83 FPS in a single Titan X, and 7 FPS in a JetsonTX1 (embedded GPU). A comprehensive set of experiments onthe publicly available Cityscapes dataset demonstrates that oursystem achieves an accuracy that is similar to the state of theart, while being orders of magnitude faster to compute thanother architectures that achieve top precision.
3 The resultingtrade-off makes our model an ideal approach for scene under-standing in IV applications. The code is publicly available at: Terms Intelligent Vehicles, Scene understanding, Real-time, Semantic segmentation, Deep learning, Residual INTRODUCTIONTHE perception tasks of Intelligent Vehicles (IV) supposeimportant challenges due to the high complexity of theenvironments in which they are required to operate ( streets). While most systems rely on sensor fusion tounderstand as much as possible of their surrounding scene,cameras have gained significant importance in the communitydue to the remarkable advances in the computer vision are a rich multi-dimensional signal that is cheapto capture, but that requires complex algorithms to vision-based approaches were initially aimed atdeveloping specific techniques for detecting traffic elementssuch as the road pavement, pedestrians, cars, signs or trafficlights independently [1].
4 However, recent advances in deeplearning have allowed to unify all of these classificationproblems into one single task: semantic task of semantic segmentation aims at labeling cate-gories at the pixel-level of an image and has direct applicationsin the computer vision field. It is a challenging task because1 EduardoRomera, the Department of Electronics, University of Alcal a (UAH),Alcal 1. Diagram that depicts the proposed segmentation system (ERFNet)for an example input image and its corresponding output (C=19 classes). Thedepicted volumes correspond to the feature maps produced by each layer.
5 Allspatial resolution values are with regard to the example input (1024x512), butthe network can operate with arbitrary image requires combining dense pixel-level accuracy with multi-scale contextual reasoning [2]. Convolutional Neural Networks(ConvNets), initially designed for classification tasks [3][4]and recently adapted to segmentation [5], have demonstratedimpressive capabilities at solving these complex are able to achieve end-to-end full-image seg-mentation with an accuracy that outperforms any traditionalmethod. However, a good trade-off between high quality andcomputational resources is yet not present in state-of-the-artsegmentation , the Residual layers proposed in [6] have supposeda new trend in ConvNets design.
6 Their reformulation of theconvolutional layers to avoid the degradation problem of deeparchitectures has allowed recent works to achieve very highaccuracies with networks that stack large amounts of strategy has been commonly adopted in new architecturesthat obtain top accuracy at both image classification challenges[6][7] and semantic segmentation challenges [8][9][10]. De-spite these achievements, we consider that this design strategy2is not an effective way to obtain a good trade-off betweenaccuracy and efficiency. Considering a reasonable amount oflayers, enlarging the depth with more convolutions achievesonly small gains in accuracy while significantly increasing therequired computational resources are a key limitation in IV applica-tions.
7 Algorithms are not only required to operate reliably, butthey are required to operate fast (real-time), fit in embeddeddevices due to space constraints (compactness), and have lowpower consumption to affect as minimum as possible thevehicle autonomy. Regarding ConvNets, all this is translatedinto the GPU resources that are required to process thenetwork parameters. With this in mind, some works haveaimed at developing efficient architectures that can run inreal-time [11][12]. However, these approaches usually focuson obtaining this efficiency by an aggressive reduction ofparameters, which highly detriments this paper, we aim at solving this trade-off as a whole,without sitting on only one of its sides.
8 We propose ERFNet(Efficient Residual Factorized Network), a ConvNet for real-time and accurate semantic segmentation. The core elementof our architecture is a novel layer design that leverages skipconnections and convolutions with 1D kernels. While the skipconnections allow the convolutions to learn Residual functionsthat facilitate training, the 1D Factorized convolutions allow asignificant reduction of the computational costs while retaininga similar accuracy compared to the 2D ones. The proposedblock is thus stacked sequentially to build our encoder-decoderarchitecture, which produces semantic segmentation end-to-end in the same resolution as the input (see Fig.)
9 1 for anexample). A comprehensive set of experiments on the chal-lenging Cityscapes [13] dataset of urban scenes demonstratesthe remarkable trade-off between accuracy and efficiency ofour architecture, reaching an accuracy as competitive as thetop networks, while also being among the fastest ones. Thispaper is an extension of our conference paper [14], whichhas been extended with a detailed description of the proposedresidual block and the full architecture ERFNet, along with anextended set of RELATEDWORKSConvNets were initially designed for image classificationchallenges, which consist in predicting single object categoriesfrom images.
10 Long et al. [5] (FCN) first adapted known clas-sification networks ( VGG16 [15]) to perform end-to-endsemantic segmentation by making them fully convolutionaland upsampling the output feature maps. However, directlyadapting these networks results in coarse pixel outputs (andthus low pixel accuracy) due to the high downsampling thatis performed in the classification task to gather more refine these outputs, the authors propose to fuse them withactivation maps from shallower layers using skip et al. [16] (SegNet) proposed to upsample the featureswith a large decoder segment that performs finer unpooling byusing the indices of the encoder s max-pooling blocks.