Aggregation Algorithms
Introduction
In each global epoch, the typical horizontal federated learning paradigm involves three stages:
the server broadcasts the global model to every client.
clients conduct model training with their local datasets, and send the model to the server after training.
the server gathers the local models and aggregates them to obtain a global model.
At the algorithm level, the third of the above three steps is the most critical part in federated learning, because it determines the training efficiency and stability of the global model. There are many aggregation algorithms that can aggregate the local models into a new global model. One of the standard aggregation methods is fedavg, which is widely used due to its simplicity and low communication. Extensive experiments have proved the availability and stability of fedavg, but many problems also emerged:
convergence is not guaranteed for non-IID (heterogeneous) data.
training is very unstable when there are Byzantine clients (i.e. clients who intentionally upload wrong parameters).
heterogeneous devices, communication efficiency, tolerance to dropped clients and other aspects to be improved.
Many published works are designed to replace fedavg to improve the performance of model training. Various mainstream aggregation algorithms have been implemented in XFL, flexibly and easy to use. Data scientists are able to choose appropriate aggregation algorithms to deal with specific federated learning tasks.
Aggregation Type
Different algorithms may be suitable for different datasets and experimental conditions. Usually, fedavg performs well, but one can try other algorithms when the training meets a bottleneck.
fedavg
fedavg [fedavg] is the base of the other algorithms, which can be described as follows:
In the prototype of fedavg, the optimizer is stochastic gradient descent(SGD), and the aggregation weight \(w^i\) is equal to the number of local batches. XFL provides a more powerful and flexible configuration mode. XFL admits arbitrary optimizer such as Adam, and users can freely change the definition of aggregation weights.
fedprox
fedprox [fedprox] is implemented based on fedavg, which may improve the training performance when the data is non-IID. For arbitrary loss function \(L\), fedprox automatically adds a regularizer \(\frac{\mu}{2}||m^i-M^t||^2\). Therefore, the actual loss function becomes \(L + \frac{\mu}{2}||m^i-M^t||^2\). \(\mu\) is a hyperparameter needs to be given, just like
"type": "fedprox",
"mu": 1,
scaffold
scaffold [scaffold] is implemented based on fedavg, which may improve the training performance when the data is non-IID. scaffold calls its own optimizer, hence any other input optimizer will be ignored.
- fedavg
McMahan B., Moore E., Ramage D. et al, Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of AISTATS, pp. 1273-1282, 2017.
- fedprox
Li T., Sahu A. K., Zaheer M. et al. Federated optimization in heterogeneous networks. In MLSys, 2020.
- scaffold
Karimireddy S. P., Kale S., Mohri M. et al. SCAFFOLD: Stochastic controlled averaging for on-device federated learning. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020.