SCAFFOLD: Stochastic Controlled Averaging for Federated ...

SCAFFOLD: Stochastic Controlled Averaging for Federated LearningSai Praneeth Karimireddy1 2 Satyen Kale3 Mehryar Mohri3 4 Sashank J. Reddi3 Sebastian U. Stich1 Ananda Theertha Suresh3 AbstractFederated Averaging (FEDAVG) has emerged asthe algorithm of choice for Federated learningdue to its simplicity and low communicationcost. However, in spite of recent research ef-forts, its performance is not fully understood. Weobtain tight convergence rates for FEDAVG andprove that it suffers from client-drift when thedata is heterogeneous (non-iid), resulting in un-stable and slow a solution, we propose a new algorithm(SCAFFOLD) which uses control variates (vari-ance reduction) to correct for the client-drift inits local updates. We prove that SCAFFOLD re-quires significantly fewer communication roundsand is not affected by data heterogeneity or clientsampling. Further, we show that (for quadratics)SCAFFOLD can take advantage of similarity inthe client s data yielding even faster latter is the first result to quantify the useful-ness of local-steps in distributed IntroductionFederated learning has emerged as an important paradigmin modern large-scale machine learning .

Unlike in tradi-tional centralized learning where models are trained usinglarge datasets stored in a central server (Dean et al., 2012;Iandola et al., 2016; Goyal et al., 2017), in Federated learn -ing, the training data remains distributed over a large num-ber of clients, which may be phones, network sensors, hos-pitals, or alternative local information sources (Kone cn`yet al., 2016b;a; McMahan et al., 2017; Mohri et al., 2019;Kairouz et al., 2019). A centralized model (referred toas server model) is then trained without ever transmitting1 EPFL, Lausanne2 Based on work performed at Google Re-search, New Research, New York4 Courant Insti-tute, New York. Correspondence to: Sai Praneeth of the37thInternational Conference on MachineLearning, Online, PMLR 119, 2020. Copyright 2020 by the au-thor(s).client data over the network, thereby ensuring a basic levelof privacy.

In this work, we investigate Stochastic optimiza-tion algorithms for Federated key challenges for Federated optimization are 1) deal-ing with unreliable and relatively slow network connectionsbetween the server and the clients, 2) only a small subsetof clients being available for training at a given time, and3) large heterogeneity (non-iid-ness) in the data present onthe different clients (Kone cn`y et al., 2016a). The most pop-ular algorithm for this setting is FEDAVG(McMahan et al.,2017). FEDAVG tackles the communication bottleneck byperforming multiple local updates on the available clientsbefore communicating to the server. While it has shownsuccess in certain applications, its performance on hetero-geneous data is still an active area of research (Li et al.,2018; Yu et al., 2019; Li et al., 2019b; Haddadpour & Mah-davi, 2019; Khaled et al., 2020). We prove that indeed suchheterogeneity has a large effect on FEDAVG it introducesadriftin the updates of each client resulting in slow and un-stable convergence.

Further, we show that this client-driftpersists even if full batch gradients are used and all clientsparticipate throughout the a solution, we propose a new Stochastic Controlled Av-eraging algorithm (SCAFFOLD) which tries to correct forthis client-drift. Intuitively, SCAFFOLD estimates the up-date direction for the server model (c) and the update direc-tion for each difference(c ci)is then anestimate of the client-drift which is used to correct the localupdate. This strategy successfully overcomes heterogene-ity and converges in significantly fewer rounds of commu-nication. Alternatively, one can see heterogeneity as intro-ducing client-variance in the updates across the differentclients and SCAFFOLD then performs client-variance re-duction (Schmidt et al., 2017; Johnson & Zhang, 2013;Defazio et al., 2014). We use this viewpoint to show thatSCAFFOLD is relatively unaffected by client , while accommodating heterogeneity is important,it is equally important that a method can take advantage ofsimilarities in the client data.

We prove that SCAFFOLD indeed has such a property, requiring fewer rounds of com-1We refer to these estimates ascontrol variatesand the result-ing correction technique as Stochastic Controlled : Stochastic Controlled Averaging for Federated Learningmunication when the clients are more summarize our main results below. We derive tighter convergence rates for FEDAVG thanpreviously known for convex and non-convex functionswith client sampling and heterogeneous data. We give matching lower bounds to prove that even withno client sampling and full batch gradients, FEDAVGcan be slower than SGD due to client-drift. We propose a new Stochastic Controlled Averaging al-gorithm (SCAFFOLD) which corrects for this client-drift. We prove that SCAFFOLD is at least as fast asSGD and converges for arbitrarily heterogeneous data. We show SCAFFOLD can additionally take advantageof similarity between the clients to further reduce thecommunication required, proving the advantage of tak-ing local steps over large-batch SGD for the first time.

We prove that SCAFFOLD is relatively unaffected bythe client sampling obtaining variance reduced rates,making it especially suitable for Federated , we confirm our theoretical results on simulated andreal datasets (extended MNIST by Cohen et al. (2017)).Related identical clients, FEDAVG coincideswith parallel SGD analyzed by (Zinkevich et al., 2010) whoproved asymptotic convergence. Stich (2018) and, morerecently Stich & Karimireddy (2019); Patel & Dieuleveut(2019); Khaled et al. (2020), gave a sharper analysis of thesame method, under the name of local SGD, also for iden-tical functions. However, there still remains a gap betweentheir upper bounds and the lower bound of Woodworthet al. (2018). The analysis of FEDAVGfor heterogeneousclients is more delicate due to the afore-mentioned client-drift, first empirically observed by Zhao et al. (2018). Sev-eral analyses bound this drift by assuming bounded gradi-ents (Wang et al.)

, 2019; Yu et al., 2019), or view it as addi-tional noise (Khaled et al., 2020), or assume that the clientoptima are -close (Li et al., 2018; Haddadpour & Mahdavi,2019). In a concurrent work, (Liang et al., 2019) propose touse variance reduction to deal with client heterogeneity butstill show rates slower than SGD and do not support clientsampling. Our method SCAFFOLD can also be seen as animproved version of the distributed optimization algorithmDANE by (Shamir et al., 2014), where a fixed number of( Stochastic ) gradient steps are used in place of a proximalpoint update. A more in-depth discussion of related workis given in Appendix A. We summarize the complexities ofdifferent methods for heterogeneous clients in Table SetupWe formalize the problem as minimizing a sum of stochas-tic functions, with only access to Stochastic samples:Table of notation used in the paperN,S, anditotal num.

, sampled num., and index of clientsR,rnumber, index of communication roundsK,knumber, index of local update stepsxraggregated server model after roundryri,kith client s model in roundrand stepkcr,cricontrol variate of server,ith client after roundrminx Rd{f(x) :=1NN i=1(fi(x) :=E i[fi(x; i)])}.The functionsfirepresents the loss function on clienti. Allour results can be easily extended to the weighted assume thatfis bounded from below byf?andfiis -smooth. Further, we assumegi(x) := fi(x; i)is anunbiased Stochastic gradient offiwith variance boundedby 2. For some results, we assume 0(strong) con-vexity. Note that only bounds the also define two non-standard terminology below.(A1)(G,B)-BGDor bounded gradient dissimilarity:there exist constantsG 0andB 1such that1NN i=1 fi(x) 2 G2+B2 f(x) 2, {fi}are convex, we can relax the assumption to1NN i=1 fi(x) 2 G2+ 2 B2(f(x) f?), x.(A2) -BHDor bounded Hessian dissimilarity: 2fi(x) 2f(x) , ,fiis -weakly convex 2fi(x) assumptions A1 and A2 are orthogonal it is possibleto haveG= 0and = 2 , or = 0butG Convergence of FedAvgIn this section we review FEDAVGand improve its conver-gence analysis by deriving tighter rates than known scheme consists of two main parts: local updates to themodel (1), and aggregating the client updates to update theserver model (2).

In each round, a subset of clientsS [N]are sampled uniformly. Each of these clientsi Scopiesthe current sever modelyi=xand performsKlocal up-dates of the form:yi yi lgi(yi).(1)Here lis the local step-size. Then the clients updates(yi x)are aggregated to form the new server model usinga global step-size gas:x x+ g|S| i S(yi x).(2)SCAFFOLD: Stochastic Controlled Averaging for Federated LearningTable of communication rounds required to reach accuracy for strongly convex and non-convex functions (log factorsare ignored). Set = for general convex rates.(G,B)bounds gradient dissimilarity (A1), and bounds Hessian dissimilarity (A2).Our rates for FEDAVGare more general and tighter than others, even matching the lower bound. However, SGD is still faster (B 1).SCAFFOLD does not require any assumptions, is faster than SGD, and is robust to client sampling. Further, when clients become moresimilar (small ), SCAFFOLD converges even convexNon-convexSamplingAssumptionsSGD (large batch) 2 NK +1 2NK 2+1 FedAvg(Li et al.)

, 2019b) 2 2NK +G2K 2 (G,0)-BGD(Yu et al., 2019) 2NK 2+G2NK (G,0)-BGD(Khaled et al., 2020) 2+G2 NK + +G +NB2 (G,B)-BGDOurs (Thm. I)1M2 SK +G +B2 M2SK 2+G 3/2+B2 X(G,B)-BGDL ower-bound (Thm. II) ( 2 NK +G )? (G,1)-BGDFedProx (Li et al., 2018)2B2 B2 (weakly convex)X = 0,(0,B)-BGDDANE (Shamir et al., 2014)2,3 2 2 = 0, -BHDVRL-SGD (Liang et al., 2019) N 2K 2+N SCAFFOLDT heorem III 2 SK +1 +NS 2SK 2+1 (NS)23X Theorem IV3 2 NK +1 K+ 2NK 2+1K + -BHD1M2:= 2+K(1 SN)G2. Note thatM2S= 2 Nwhen no sampling (S=N).2proximal point method only for quadratic Rate of convergenceWe now state our novel convergence results for functionswith bounded dissimilarity (proofs in Appendix ).Theorem -smooth functions{fi}which satisfy(A1), the output ofFEDAVGhas expected error smallerthan in each of the below three cases for some values of land g, with the following bound onR Strongly convex:R= O( 2 KS +(1 SN)G2 S + G +B2 ), General convex:R=O( 2D2KS 2+(1 SN)G2D2S 2+ G 32+B2 D2 ), Non-convex:R=O( 2 FKS 2+(1 SN)G2FS 2+ G 32+B2 F ),whereD:= x0 x?

SCAFFOLD: Stochastic Controlled Averaging for Federated ...

Tags:

Information

Transcription of SCAFFOLD: Stochastic Controlled Averaging for Federated ...

Related search queries

SCAFFOLD: Stochastic Controlled Averaging for Federated ...

Tags:

Information

Documents from same domain

Related documents

Related search queries