A Cloud-Scale Acceleration Architecture

A Cloud-Scale Acceleration ArchitectureAdrian M. CaulfieldEric S. ChungAndrew PutnamHari AngepatJeremy FowersMichael HaselmanStephen HeilMatt HumphreyPuneet KaurJoo-Young KimDaniel LoTodd MassengillKalin OvtcharovMichael PapamichaelLisa WoodsSitaram LankaDerek ChiouDoug BurgerMicrosoft CorporationAbstract Hyperscale datacenter providers have struggled tobalance the growing need for specialized hardware (efficiency)with the economic benefits of homogeneity (manageability). Inthis paper we propose a new cloud Architecture that usesreconfigurable logic to accelerate both network plane func-tions and applications. This Configurable cloud architectureplaces a layer of reconfigurable logic (FPGAs) between thenetwork switches and the servers, enabling network flows to beprogrammably transformed at line rate, enabling accelerationof local applications running on the server, and enabling theFPGAs to communicate directly, at datacenter scale , to harvestremote FPGAs unused by their local servers.

We deployed thisdesign over a production server bed, and show how it can beused for both service Acceleration (Web search ranking) andnetwork Acceleration (encryption of data in transit at high-speeds). This Architecture is much more scalable than priorwork which used secondary rack- scale networks for inter-FPGA communication. By coupling to the network plane, direct FPGA-to-FPGA messages can be achieved at comparable latency toprevious work, without the secondary network. Additionally, thescale of direct inter-FPGA messaging is much larger. The averageround-trip latencies observed in our measurements among 24,1000, and 250,000 machines are under 3, 9, and 20 microseconds,respectively. The Configurable cloud Architecture has beendeployed at hyperscale in Microsoft s production INTRODUCTIONM odern hyperscale datacenters have made huge strides withimprovements in networking, virtualization, energy efficiency,and infrastructure management, but still have the same basicstructure as they have for years: individual servers withmulticore CPUs, DRAM, and local storage, connected by theNIC through Ethernet switches to other servers.

At hyperscale(hundreds of thousands to millions of servers), there are signif-icant benefits to maximizing homogeneity; workloads can bemigrated fungibly across the infrastructure, and managementis simplified, reducing costs and configuration the slowdown in CPU scaling and the ending ofMoore s Law have resulted in a growing need for hard-ware specialization to increase performance and , placing specialized accelerators in a subset of ahyperscale infrastructure s servers reduces the highly desir-able homogeneity. The question is mostly one of economics:whether it is cost-effective to deploy an accelerator in everynew server, whether it is better to specialize a subset ofan infrastructure s new servers and maintain an ever-growingnumber of configurations, or whether it is most cost-effectiveto do neither.

Any specialized accelerator must be compatiblewith the target workloads through its deployment lifetime ( years: two years to design and deploy the accelerator andfour years of server deployment lifetime). This requirementis a challenge given both the diversity of cloud workloadsand the rapid rate at which they change (weekly or monthly).It is thus highly desirable that accelerators incorporated intohyperscale servers be programmable, the two most commonexamples being FPGAs and GPUs and FPGAs have been deployed in datacenterinfrastructure at reasonable scale without direct connectivitybetween accelerators [1], [2], [3]. Our recent publicationdescribed a medium- scale FPGA deployment in a productiondatacenter to accelerate Bing web search ranking using multi-ple directly-connected accelerators [4].

That design consistedof a rack- scale fabric of 48 FPGAs connected by a secondarynetwork. While effective at accelerating search ranking, ourfirst Architecture had several significant limitations: The secondary network (a 6x8 torus) required expensiveand complex cabling, and required awareness of the physicallocation of machines. Failure handling of the torus required complex re-routingof traffic to neighboring nodes, causing both performance lossand isolation of nodes under certain failure patterns. The number of FPGAs that could communicate directly,without going through software, was limited to a single rack( 48 nodes). The fabric was a limited- scale bolt on accelerator, whichcould accelerate applications but offered little for enhancingthe datacenter infrastructure, such as networking and this paper, we describe a new Cloud-Scale , FPGA-basedacceleration Architecture , which we call theConfigurableCloud, which eliminates all of the limitations listed above witha single design.

This Architecture has been and is being deployed in the majority of new servers in Microsoft sproduction datacenters across more than 15 countries and5 continents. A Configurable cloud allows the datapath ofcloud communication to be accelerated with programmablehardware. This datapath can include networking flows, stor-age flows, security operations, and distributed (multi-FPGA) key difference over previous work is that the accelera-978-1-5090-3508-3/16/$ 2016 IEEETORTORTORTORL1L1 Expensive compressionDeep neural networksWeb search rankingBioinformaticsWeb search rankingL2 TOR(a)(b)Fig. 1. (a) Decoupled Programmable Hardware Plane, (b) Server + FPGA hardware is tightly coupled with the datacenter network placing a layer of FPGAs between the servers NICs andthe Ethernet network switches.

Figure 1b shows how theaccelerator fits into a host server. All network traffic is routedthrough the FPGA, allowing it to accelerate high-bandwidthnetwork flows. An independent PCIe connection to the hostCPUs is also provided, allowing the FPGA to be used as a localcompute accelerator. The standard network switch and topol-ogy removes the impact of failures on neighboring servers,removes the need for non-standard cabling, and eliminates theneed to track the physical location of machines in each placing FPGAs as a network-side bump-in-the-wire solves many of the shortcomings of the torus topology, muchmore is possible. By enabling the FPGAs to generate andconsume their own networking packets independent of thehosts, each and every FPGA in the datacenter can reachevery other one (at a scale of hundreds of thousands) ina small number of microseconds, without any interveningsoftware.

This capability allows hosts to use remote FPGAs foracceleration with low latency, improving the economics of theaccelerator deployment, as hosts running services that do notuse their local FPGAs can donate them to a global pool andextract value which would otherwise be stranded. Moreover,this design choice essentially turns the distributed FPGA resources into an independent computer in the datacenter,at the same scale as the servers, that physically shares thenetwork wires with software. Figure 1a shows a logical viewof this plane of model offers significant flexibility. From the localperspective, the FPGA is used as a compute or a networkaccelerator. From the global perspective, the FPGAs can bemanaged as a large- scale pool of resources, with accelerationservices mapped to remote FPGA resources.

Ideally, serversnot using all of their local FPGA resources can donatethose resources to the global pool, while servers that needadditional resources can request the available resources onremote servers. Failing nodes are removed from the poolwith replacements quickly added. As demand for a servicegrows or shrinks, a global manager grows or shrinks the poolscorrespondingly. Services are thus freed from having a fixedratio of CPU cores per FPGAs, and can instead allocate (orpurchase, in the case of IaaS) only the resources of each limitations prevent a complete description of themanagement policies and mechanisms for the global resourcemanager. Instead, this paper focuses first on the hardwarearchitecture necessary to treat remote FPGAs as availableresources for global Acceleration pools.

We describe the com-munication protocols and mechanisms that allow nodes ina remote Acceleration service to connect, including a proto-col called LTL (Lightweight Transport Layer) that supportslightweight connections between pairs of FPGAs, with mostlylossless transport and extremely low latency (small numbersof microseconds). This protocol makes the datacenter-scaleremote FPGA resources appear closer than either a single localSSD access or the time to get through the host s networkingstack. Then, we describe an evaluation system of 5,760 serverswhich we built and deployed as a precursor to hyperscaleproduction deployment. We measure the performance charac-teristics of the system, using web search and network flowencryption as examples. We show that significant gains inefficiency are possible, and that this new Architecture enables amuch broader and more robust Architecture for the accelerationAltera Stratix V D5 FPGA256 Mb ConfigFlashUSB4 GB DDR3-160040Gb QSFP Network to TORUSB to J TA G C40Gb QSFP Network to NIC4 lanes @ Gbps4 lanes @ Gbps72 bits (with ECC)QSPIPCIe MezanineConnectorPCIe Gen3 x8 PCIe Gen3 x8 Temp, power, LEDsI2 CFig.

A Cloud-Scale Acceleration Architecture

Tags:

Information

Transcription of A Cloud-Scale Acceleration Architecture

Related search queries

A Cloud-Scale Acceleration Architecture

Tags:

Information

Documents from same domain

Related documents

Related search queries