Rammer: Enabling Holistic Deep Learning Compiler ...

This paper is included in the Proceedings of the 14th USENIX Symposium on Operating Systems Design and ImplementationNovember 4 6, 2020978-1-939133-19-9 Open access to the Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIXR ammer: Enabling Holistic deep Learning Compiler Optimizations with rTask sLingxiao Ma, Peking University and Microsoft Research; Zhiqiang Xie, ShanghaiTech University and Microsoft Research; Zhi Yang, Peking University; Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou, Microsoft : Enabling Holistic deep Learning Compiler Optimizations withrTasksLingxiao Ma Zhiqiang Xie Zhi Yang Jilong Xue Youshan Miao Wei Cui Wenxiang Hu Fan Yang Lintao Zhang Lidong Zhou Peking University ShanghaiTech University Microsoft ResearchAbstractPerforming deep Neural Network (DNN) computation onhardware accelerators efficiently is challenging.

ExistingDNN frameworks and compilers often treat the DNN op-erators in a data flow graph (DFG) as opaque library func-tions and schedule them onto accelerators to be executedindividually. They rely on another layer of scheduler, oftenimplemented in hardware, to exploit the parallelism availablein the operators. Such a two-layered approach incurs signif-icant scheduling overhead and often cannot fully utilize theavailable hardware resources. In this paper, we proposeRAM-MER, a DNN Compiler design that optimizes the executionof DNN workloads on massively parallel an efficient static spatio-temporal schedulefor a DNN at compile time to minimize scheduling maximizes hardware utilization by holistically exploitingparallelism through inter- and intra- operator this by proposing several novel, hardwareneutral, and clean abstractions for the computation tasks andthe hardware accelerators.

These abstractions expose a muchricher scheduling space toRAMMER, which employs severalheuristics to explore this space and finds efficient implementRAMMERfor multiple hardware backendssuch as NVIDIA GPUs, AMD GPUs, and Graphcore showRAMMER significantly outperforms state-of-the-art compilers such as TensorFlow XLA and TVM byup to . It also outperforms TensorRT, a vendor opti-mized proprietary DNN inference library from NVIDIA, byup to .1 IntroductionDeep neural network (DNN) is now a widely adopted ap-proach for image classification, natural language process-ing, and many other AI tasks. Due to its importance, manycomputational devices, such as CPU, GPU, FPGA, and spe-cially designed DNN accelerators have been leveraged to Both authors contributed DNN computation. Efficient DNN computation onthese devices is an important topic that has attracted muchresearch attention in recent years [23, 28, 32, 40, 52].

One ofthe key factors that affect the efficiency of DNN computa-tion is scheduling, deciding the order to perform variouspieces of computation on the target hardware. The impor-tance of scheduling in general is well known and has beenthoroughly studied [20, 39]. However, there is little workdiscussing scheduling for DNN computation on hardwaredevices computational pattern of a deep neural network is usu-ally modeled as a data flow graph (DFG), where each nodecorresponds to an operator, which represents a unit of com-putation such as matrix multiplication, while an edge depictsthe dependency between operators. This representation natu-rally contains two levels of parallelism. The first level is theinter-operator parallelism, where operators that do not havedependencies in the DFG may run in parallel. The secondlevel is the intra-operator parallelism, where an operator suchas matrix multiplication has inherent internal data parallelismand can leverage hardware accelerators that can perform par-allel computation, such as a exploit the two levels of parallelism, current practiceadopts a two-layered scheduling approach.

An inter-operatorDFG layer scheduler takes the data flow graph and emits oper-ators that are ready to be executed based on the addition, an intra-operator scheduler takes an operator andmaps it to the parallel execution units in the accelerator. Thislayering design has a fundamental impact on the system archi-tectures of the existing DNN tool sets. For example, the DFGlayer scheduler is typically implemented in deep learningframeworks such as TensorFlow [18] or ONNX Runtime [14].The operator layer scheduler, on the other hand, is often hid-den behind the operator libraries such as cuDNN [12] andMKL-DNN [9], and sometimes implemented directly in hard-ware, as is the case for widely adopted by existing frameworks and acceler-ators, such a two-layer scheduling approach incurs fundamen-tal performance limitations. The approach works well onlyUSENIX Association14th USENIX Symposium on Operating Systems Design and Implementation 881when the overhead of emitting operators is largely negligiblecompared to the execution time of operators, and when thereis sufficient intra-operator parallelism to saturate all process-ing units in an accelerator.

This unfortunately is often notthe case in practice. DNN accelerators keep on increasingperformance at a much faster pace than CPUs, thus makingthe operator emitting overhead more and more is exacerbated for DNN inference workloads when thebatch size is small, which limits the intra-operator , the two-layer scheduling approach overlooks thesubtle interplay between the upper and lower layers: to op-timize the overall performance, a system could reduce thedegree of intra-operator parallelism in order to increase thelevel of inter-operator parallelism ( 2).To mitigate these limitations, we presentRAMMER, a deeplearning Compiler that takes a Holistic approach to manage theparallelism available in the DNN computation for unifies the inter- and intra-operator scheduling through anovel abstraction enables the scheduler tobreak the operator boundary and allows fine-grained schedul-ing of computation onto devices.

Instead of the existing designthat breaks scheduling into two pieces managed by softwareand hardware separately,RAMMERis a unified software-onlysolution, which makes it less dependent on underlying hard-ware and thus can be adopted by diverse DNN RAMMER, we make the following design , to exploit the intra-operator parallelism through asoftware Compiler ,RAMMER redefines a DNN operator asanrTask-operator orrOperator. AnrOperator consists ofmultiple independent, homogeneousrTasks, each is a mini-mum schedulable unit runs on a single execution unit of anaccelerator ( , a streaming multiprocessor SM in a GPU).Thus,rTask as the fine-grained intra-operator information isexposed to a DNNas a data flow graph ofrOperator nodes, hence it can still seethe coarse-grained inter-operator (DFG) , certain modern accelerators such as GPU donot expose interfaces for intra-operator ( ,rTask) schedul-ing.

To address this challenge, as a second design decisionRAMMER abstracts a hardware accelerator as avirtualizedparallel device(vDevice), which contains multiple virtualizedexecution units (vEU). The vDevice allows severalrTasks,even from different operators, to run on a specified vEU ina desired order. Moreover, a vEU can run abarrier rTaskthat waits for the completion of a specified set ofrTasks,thus ensuring the correct execution ofrTasks from dependentoperators. The vDevice maps a vEU to one of the physicalexecution units in an accelerator to perform the actual com-putation , fine-grained scheduling could incur significant run-time overheads, even more so than the operator schedulingoverhead discussed previously. To address this issue,RAM-MERmoves the scheduling decision from runtime to compiletime. This is driven by the observation that most DNN s DFGis available at the compile time, and the operators usuallyexhibit deterministic performance characteristics.

Therefore,the runtime performance can be obtained through compiletime profiling [45]. This not only avoids unnecessary runtimeoverheads, but also allows a more costly scheduling policy tofully exploit the inter- and intra- operator parallelism compatible with optimizations developed inexisting DNN import a data-flowgraph from other frameworks like TensorFlow. Such a DFGcan be optimized with techniques employed by a traditionalgraph optimizer such as [18]. AnrOperator can also be opti-mized by an existing kernel tuner [23]. Our experience showsthat, on top of existing optimizations,RAMMERcan providesignificant additional performance improvement, especiallyfor DNN inference hardware neutral. The abstractions proposed,such asrTask,rOperator and vEU are applicable to any mas-sively parallel computational devices with homogeneous exe-cution units.

This includes almost all the computational de-vices proposed for DNN workloads. In this paper, in addi-tion to describe in detail howRAMMERis implemented onNVIDIA GPUs, we will also discuss our experience retarget-ing RAMMERfor several alternative computing have implementedRAMMER with 52k lines of C++code and open-sourced the code1. Our evaluation on 6 DNNmodels shows thatRAMMER significantly outperforms state-of-the-art compilers like XLA and TVM on both NVIDIA andAMD GPUs, with up to out-performs TensorRT [13], a vendor optimized DNN inferencelibrary from NVIDIA, with up to experience onRAMMER strongly suggests that the cur-rent industry-prevalent practice of vendor supplying highlyoptimized DNN operator implementations in a library form(such as cuDNN and MKL-DNN) is sub-optimal. This prac-tice will incur significant efficiency cost for DNN situation will become even worse in the coming years asmodern accelerators keep on increasing the available hard-ware parallelism while new DNN architectures strive to savecomputation by replacing larger operators with many smallerones [49,54].

Rammer: Enabling Holistic Deep Learning Compiler ...

Tags:

Information

Transcription of Rammer: Enabling Holistic Deep Learning Compiler ...

Related search queries

Rammer: Enabling Holistic Deep Learning Compiler ...

Tags:

Information

Documents from same domain

Related documents

Related search queries