Federated Learning - University of California, Berkeley

Federated LearningMin DuPostdoc, UC BerkeleyOutlineqPreliminary: deep Learning and SGDqFederated Learning : FedSGDand FedAvgqRelated research in Federated learningqOpen problemsOutlineqPreliminary: deep Learning and SGDqFederated Learning : FedSGDand FedAvgqRelated research in Federated learningqOpen problems Find a function, which produces a desired output given a particular taskGiven inputDesired outputImage classification8 Playing GONext moveNext-word-predictionLooking forward to your ?reply is the set of parameters contained by the functionThe goal of deep Learning Given one input sample pair #, #,the goal of deep Learning model training is to find a set of parameters , to maximize the probability of outputting #given #.

Given input: #Maximize: (5| #, )Finding the function: model training Given a training dataset containing input-output pairs ,, ,, 1, ,the goal of deep Learning model training is to find a set of parameters , such that the average of ( ,)is maximized given ,. Given input:Output: Finding the function: model training That is,Which is equivalent to 1 4,567 ( ,| ,, ) 1 4,567 log( ( ,| ,, ))A basic component for loss function ( ,, ,, )given sample ,, ,:Let , = ( ,, ,, )denote the loss function. Given a training dataset containing input-output pairs ,, ,, 1, ,the goal of deep Learning model training is to find a set of parameters , such that the average of ( ,)is maximized given.

Finding the function: model trainingDeep Learning model trainingFor a training dataset containing samples ( ,, ,),1 , the training objective is:minC E ( )where 67 ,567 ,( ) , = ( ,, ,, )is the loss of the prediction on example ,, ,No closed-form solution: in a typical deep Learning model, may contain millions of : multiple local minima exist. ( )Solution: Gradient Descent Loss ( )Randomly initialized weight Compute gradient ( ) IJ6= I ( )(Gradient Descent)At the local minimum, ( )is close to rate controls the step sizeHow to stop? when the update is small enough converge. IJ6 I or ( I) Problem: Usually the number of training samples n is large slow convergenceSolution: Stochastic Gradient Descent (SGD) At each step of gradient descent, instead of compute for all training samples, randomly pick a small subset (mini-batch) of training samples N, N.

Compared to gradient descent, SGD takes more steps to converge, but each step is much faster. IJ6 I I; N, NOutlineqPreliminary: deep Learning and SGDqFederated Learning : FedSGDand FedAvgqRelated research in Federated learningqOpen problemsThe biggest obstacle to using advanced data analysis isn t skill base or technology; it s plain old access to the Wilder-James, Harvard Business Review The importance of data for ML Data is the New Oil Google, Apple, ..ML modelPrivate data: all the photosa user takes and everything they type on their mobile keyboard, includingpasswords, URLs, messages, etc. image to predict which photos are most likely to be viewed multiple times in the future; language voice recognition,next-word-prediction, and auto-reply in GmailGoogle, Apple.

Instead of uploading the raw data, train a model locally and upload the model. Addressing privacy:Model parameters will never contain more information than the raw training data Addressing network overhead:The size of the model is generally smaller than the size of the raw training dataML modelML modelML modelMODEL AGGREGATIONML modelFederated optimization Characteristics (Major challenges) Non-IID The data generated by each user are quite different Unbalanced Some users produce significantly more data than others Massively distributed # mobile device owners >> avg # training samples on each device Limited communication Unstable mobile network connectionsA new paradigm Federated Learninga synchronous update scheme that proceeds in rounds of communication McMahan, H.

Brendan, Eider Moore, Daniel Ramage, and Seth Hampson. "Communication-efficient Learning of deep networks from decentralized data."AISTATS, dataLocal dataLocal dataLocal dataCentral ServerGlobal model M(i)Model M(i)Model M(i)Model M(i)Model M(i)Gradient updates for M(i)Gradient updates for M(i)Gradient updates for M(i)Gradient updates for M(i)In round number by Google, Apple, Learning overviewLocal dataLocal dataLocal dataLocal dataCentral ServerIn round number of M(i)Updates of M(i)Updates of M(i)Updates of M(i)Gradient updates for M(i)Gradient updates for M(i)Gradient updates for M(i)Gradient updates for M(i)Model AggregationM(i+1) Federated Learning overviewLocal dataCentral ServerLocal dataLocal dataLocal dataGlobal model M(i+1)Model M(i+1)Model M(i+1)

Model M(i+1)Model M(i+1)Round number i+1 and Learning overviewFederated Learning detailFor efficiency,at the beginning of each round, a random fraction C of clientsis selected, and the server sends the current model parameters to each of these clients. Federated Learning detail Recall in traditional deep Learning model training For a training dataset containing samples ( ,, ,),1 , the training objective is: , = ( ,, ,, )is the loss of the prediction on example ,, , Deep Learning optimization relies on SGD and its variants, through mini-batches IJ6 I I; N, NminC E ( )where 67 ,567 ,( ) Federated Learning detail In Federated Learning Suppose training samples are distributed to clients, where Nis the set of indices of data points on client , and N= N.

For training objective:minC E ( ) = N56T7U7 N( )where N( ) 67U , WU ,( )A baseline FederatedSGD(FedSGD) A randomly selected client that has Ntraining data samples in Federated Learning A randomly selected sample in traditional deep Learning Federated SGD (FedSGD): a single step of gradient descent is done per round Recall in Federated Learning , a C-fraction of clients are selected at each round. C=1: full-batch (non-stochastic) gradient descent C<1: stochastic gradient descent (SGD)A baseline FederatedSGD(FedSGD) Learning rate: ; total #samples: ; total #clients: ; #samples on a client k: N; clients fraction =1 In a round t: The central server broadcasts current model Ito each client; each client kcomputes gradient: N= N( I), on its local data.

Approach 1: Each client k submits N; the central server aggregates the gradients to generate a new model: IJ6 I I= I N56T7U7 N. Approach 2: Each client k computes: IJ6N I N; the central server performs aggregation: IJ6 N56T7U7 IJ6 NFor multiple times FederatedAveraging(FedAvg)Recall fw= ^56_`a`F^(w) Federated Learning deal with limited communication Increase computation Select more clients for training between each communication round Increase computation on each clientFederated Learning FederatedAveraging(FedAvg) Learning rate: ; total #samples: ; total #clients: ; #samples on a client k: N; clients fraction In a round t: The central server broadcasts current model Ito each client; each client kcomputes gradient: N= N( I), on its local data.

Approach 2: Each client k computes for Eepochs : IJ6N I N The central server performs aggregation: IJ6 N56T7U7 IJ6N Suppose Bis the local mini-batch size, #updates on client kin each round: N= Learning FederatedAveraging(FedAvg)Model initialization Two choices: On the central server On each clientThe loss on the full MNIST training set for models generated by +(1 ) iShared initialization works better in Learning FederatedAveraging(FedAvg)Model averaging As shown in the right figure:The loss on the full MNIST training set for models generated by +(1 ) iIn practice, na ve parameter averaging works surprisingly Learning FederatedAveraging(FedAvg) first, a model is randomly initialized on the central each round random set of clients are chosen; client performs local gradient descent steps.

Federated Learning - University of California, Berkeley

Tags:

Information

Transcription of Federated Learning - University of California, Berkeley

Related search queries

Federated Learning - University of California, Berkeley

Tags:

Information

Documents from same domain

Related documents

Related search queries