Optimizing FPGA-based Accelerator Design for Deep ...

Optimizing FPGA-based Accelerator Design for DeepConvolutional Neural NetworksChen Cong2,3,1, for Energy-Efficient Computing and Applications, Peking University, China2 Computer Science Department, University of California, Los Angeles, USA3 PKU/UCLA Joint Research Institute in Science and EngineeringABSTRACTC onvolutional neural network (CNN) has been widely em-ployed for image recognition because it can achieve high ac-curacy by emulating behavior of optic nerves in living crea-tures. Recently, rapid growth of modern applications basedon deep learning algorithms has further improved researchand implementations. Especially, various accelerators fordeep CNN have been proposed based on FPGA platformbecause it has advantages of high performance, reconfigura-bility, and fast development round, etc. Although currentFPGA accelerators have demonstrated better performanceover generic processors, the Accelerator Design space has notbeen well exploited.

One critical problem is that the com-putation throughput may not well match the memory band-width provided an FPGA platform. Consequently, existingapproaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. Atthe same time, the increasing complexity and scalability ofdeep learning applications aggravate this problem. In orderto overcome this problem, we propose an analytical designscheme using the roofline model. For any solution of a CNNdesign, we quantitatively analyze its computing throughputand required memory bandwidth using various optimizationtechniques, such as loop tiling and transformation. Then,with the help of roofline model, we can identify the solutionwith best performance and lowest FPGA resource require-ment. As a case study, we implement a CNN acceleratoron a VC707 FPGA board and compare it to previous ap-proaches.

Our implementation achieves a peak performanceof GFLOPS under 100 MHz working frequency, whichoutperform previous approaches significantly. In addition to being a faculty member at UCLA, Jason Cong is also aco-director of the PKU/UCLA Joint Research Institute and a visitingchair professor of Peking to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from 15,February 22 24, 2015, Monterey, California, is held by the owner/author(s).

Publication rights licensed to 978-1-4503-3315-3/15/02 ..$ and Subject [SPECIAL-PURPOSE AND APPLICATION-BASEDSYSTEMS]: Microprocessor/microcomputer applicationsKeywordsFPGA; Roofline Model; Convolutional Neural Network; Ac-celeration1. INTRODUCTIONC onvolutional neural network (CNN), a well-known deeplearning architecture extended from artificial neural network,has been extensively adopted in various applications, whichinclude video surveillance, mobile robot vision, image searchengine in data centers, etc [6] [7] [8] [10] [14]. Inspired by thebehavior of optic nerves in living creatures, a CNN designprocesses data with multiple layers of neuron connections toachieve high accuracy in image recognition. Recently, rapidgrowth of modern applications based on deep learning algo-rithms has further improved research on deep convolutionalneural to the specific computation pattern of CNN, generalpurpose processors are not efficient for CNN implementationand can hardly meet the performance requirement.

Thus,various accelerators based on FPGA, GPU, and even ASIC Design have been proposed recently to improve performanceof CNN designs [3] [4] [9]. Among these approaches, FPGA based accelerators have attracted more and more attentionof researchers because they have advantages of good perfor-mance, high energy efficiency, fast development round, andcapability of reconfiguration [1] [2] [3] [6] [12] [14].For any CNN algorithm implementation, there are a lotof potential solutions that result in a huge Design space forexploration. In our experiments, we find that there couldbe as much as 90% performance difference between two dif-ferent solutions with the same logic resource utilization ofFPGA. It is not trivial to find out the optimal solution,especially when limitations on computation resource andmemory bandwidth of an FPGA platform are fact, if an Accelerator structure is not carefully designed,its computing throughput cannot match the memory band-width provided an FPGA platform.

It means that the per-formance is degraded due to under-utilization of either logicresource or memory , both advances of FPGA technology anddeep learning algorithm aggravate this problem at the sametime. On one hand, the increasing logic resources and mem-ory bandwidth provided by state-of-art FPGA platforms en-large the Design space. In addition, when various FPGA optimization techniques, such as loop tiling and transforma-tion, are applied, the Design space is further expanded. Onthe other hand, the scale and complexity of deep learning al-gorithms keep increasing to meet the requirement of modernapplications. Consequently, it is more difficult to find outthe optimal solution in the Design space. Thus, an efficientmethod is urgently required for exploration of FPGA basedCNN Design efficiently explore the Design space, we propose an an-alytical Design scheme in this work.

Our work outperformsprevious approaches for two reasons. First, work [1] [2] [3][6] [14] mainly focused on computation engine either ignore external memory operation or connecttheir Accelerator directly to external memory. Our work,however, takes buffer management and bandwidth optimiza-tion into consideration to make better utilization of FPGA resource and achieve higher performance. Second, previousstudy [12] accelerates CNN applications by reducing externaldata access with delicate data reuse. However, this methoddo not necessarily lead to best overall performance. More-over, their method needs to reconfigure FPGA for differentlayers of computation. This is not feasible in some Accelerator is able to execute acceleration jobs acrossdifferent layers without reprogramming main contributions of this work are summarized asfollows, We quantitatively analyze computing throughput andrequired memory bandwidth of any potential solutionof a CNN Design on an FPGA platform.

Under the constraints of computation resource andmemory bandwidth, we identify all possible solutionsin the Design space using a roofline model. In addi-tion, we discuss how to find the optimal solution foreach layer in the Design space. We propose a CNN Accelerator Design with uniformloop unroll factors across different convolutional layers. As a case study, we implement a CNN Accelerator thatachieves a performance of GFLOPS. To the bestof our knowledge, this implementation has highest per-formance and the highest performance density amongexisting rest of this paper is organized as follows: Section 2provides a background for CNN and roofline model. Sec-tion 3 presents our analytical approach for Optimizing accel-erator Design . Section 4 describes implementation 5 shows our experiment result. Section 6 makescomparison between our implementation and existing workand Section 7 concludes the CNN BasicsConvolutional neural network (CNN) is first inspired byresearch in neuroscience.

After over twenty years of evolu-tion, CNN has been gaining more and more distinction inresearch fields, such as computer vision, AI ( [11] [9]).As a classical supervised learning algorithm, CNN employsa feedforward process for recognition and a backward pathfor training. In industrial practice, many application de-signers train CNN off-line and use the off-line trained CNNto perform time-sensitive jobs. So the speed of feedforwardcomputation is what matters. In this work, we focus onspeeding up the feedforward computation with FPGA basedaccelerator typical CNN is composed of two components: a featureextractor and a classifier. The feature extractor is used tofilter input images into feature maps that represent variousfeatures of the image. These features may include corners,lines, circular arch, etc., which are relatively invariant to po-sition shifting or distortions.

The output of the feature ex-tractor is a low-dimensonal vector containing these vector is then fed into the classifier, which is usuallybased on traditional artificial neural networks. The purposeof this classifier is to decide the likelihood of categories thatthe input ( image) might belong typical CNN is composed of multiple computation lay-ers. For example, the feature extractor may consist of sev-eral convolutional layers and optional sub-sampling 1 illustrates the computation of a convolutional convolutional layer receivesNfeature maps as input feature map is convolved by a shifting windowwith aK Kkernel to generate one pixel in one output fea-ture map. The stride of the shifting window isS, which isnormally smaller thanK. A total ofMoutput feature mapswill form the set of input feature maps for the next convo-lutional layer. The pseudo code of a convolutional layer canbe written as that in Code 1: Graph of a convolutional layerfor( row=0; row<R; row++){for( c o l =0; col<C; c o l++){for( to =0; to<M; to++){for( t i =0; ti<N; t i ++){for( i =0; i<K; i++){for( j =0; j<K; j++){L : outputfm [ to ] [ row ] [ c o l ] +=weights [ to ] [ t i ] [ i ] [ j ] inputfm [ t i ] [ S row+i ] [ S c o l+j ] ;}}}}}}Code 1: Pseudo code of a convolutional layerIn the feedforward computation perspective, a previousstudy [5] proved that convolution operations will occupy over90% of the computation time.

Optimizing FPGA-based Accelerator Design for Deep ...

Tags:

Information

Advertisement

Transcription of Optimizing FPGA-based Accelerator Design for Deep ...