Aggregation Algorithms

Introduction

In each global epoch, the typical horizontal federated learning paradigm involves three stages:

the server broadcasts the global model to every client.
clients conduct model training with their local datasets, and send the model to the server after training.
the server gathers the local models and aggregates them to obtain a global model.

At the algorithm level, the third of the above three steps is the most critical part in federated learning, because it determines the training efficiency and stability of the global model. There are many aggregation algorithms that can aggregate the local models into a new global model. One of the standard aggregation methods is fedavg, which is widely used due to its simplicity and low communication. Extensive experiments have proved the availability and stability of fedavg, but many problems also emerged:

convergence is not guaranteed for non-IID (heterogeneous) data.
training is very unstable when there are Byzantine clients (i.e. clients who intentionally upload wrong parameters).
heterogeneous devices, communication efficiency, tolerance to dropped clients and other aspects to be improved.

Many published works are designed to replace fedavg to improve the performance of model training. Various mainstream aggregation algorithms have been implemented in XFL, flexibly and easy to use. Data scientists are able to choose appropriate aggregation algorithms to deal with specific federated learning tasks.

Aggregation Type

Different algorithms may be suitable for different datasets and experimental conditions. Usually, fedavg performs well, but one can try other algorithms when the training meets a bottleneck.

fedavg

fedavg [fedavg] is the base of the other algorithms, which can be described as follows:

input: number of parties \(P\), global epoch \(T\), local epoch \(E\), loss function, optimizer

output: Single global model \(M^T\)

server executes:

for \(t=0,1,...,T-1\) do

broadcast \(M^t\) to all clients

for \(i=0,1,...,P-1\) in parallel do

conduct local training, upload local model \(m^i\) and aggregation weight \(w^i\)

\(M^{t+1} \leftarrow \frac{\sum_iw^im^i}{\sum_iw^i}\)

return \(M^T\)

party executes:

\(m^i \leftarrow M^t\)

for epoch \(k=0,1,...,E-1\) do

for each batch do

update parameters according to the loss function and optimizer

calculate the aggregation weight \(w^i\)

return \(m^i\) and \(w^i\) to the server

In the prototype of fedavg, the optimizer is stochastic gradient descent(SGD), and the aggregation weight \(w^i\) is equal to the number of local batches. XFL provides a more powerful and flexible configuration mode. XFL admits arbitrary optimizer such as Adam, and users can freely change the definition of aggregation weights.

fedprox

fedprox [fedprox] is implemented based on fedavg, which may improve the training performance when the data is non-IID. For arbitrary loss function \(L\), fedprox automatically adds a regularizer \(\frac{\mu}{2}||m^i-M^t||^2\). Therefore, the actual loss function becomes \(L + \frac{\mu}{2}||m^i-M^t||^2\). \(\mu\) is a hyperparameter needs to be given, just like

"type": "fedprox",
"mu": 1,

scaffold

scaffold [scaffold] is implemented based on fedavg, which may improve the training performance when the data is non-IID. scaffold calls its own optimizer, hence any other input optimizer will be ignored.

fedavg: McMahan B., Moore E., Ramage D. et al, Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of AISTATS, pp. 1273-1282, 2017.
fedprox: Li T., Sahu A. K., Zaheer M. et al. Federated optimization in heterogeneous networks. In MLSys, 2020.
scaffold: Karimireddy S. P., Kale S., Mohri M. et al. SCAFFOLD: Stochastic controlled averaging for on-device federated learning. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020.