paper14

Accelerated Federated Learning with Decoupled Adaptive Optimization
Jiayin Jin∗ 1 Jiaxiang Ren∗ 1 Yang Zhou 1 Lingjuan Lyu 2 Ji Liu 3 Dejing Dou 3 4
Abstract 2017a; Kairouz et al., 2021). In a FL model, end clients

The federated learning (FL) framework enables (e.g., Android phones) use their local data to train a local
edge clients to collaboratively learn a shared in- ML model, while keeping the local data decentralized. The
ference model while keeping privacy of training end clients send the parameter updates of local models (e.g.,
data on clients. Recently, many heuristics efforts Android phone updates) rather than raw data to the central
have been made to generalize centralized adap- server (e.g., Android cloud). The server produces a shared
tive optimization methods, such as SGDM, Adam, global ML model by aggregating the local updates.
AdaGrad, etc., to federated settings for improving One of critical challenges in the FL paradigm is ex-
convergence and accuracy. However, there is still pensive communication cost between the clients and
a paucity of theoretical principles on where to and server (Konečný et al., 2016b; Caldas et al., 2018a; Liu
how to design and utilize adaptive optimization et al., 2019; Hamer et al., 2020a; He et al., 2020). Tra-
methods in federated settings. This work aims to ditional federated optimization methods, such as FedAvg
develop novel adaptive optimization methods for and its variants (Konečný et al., 2016b;a; McMahan et al.,
FL from the perspective of dynamics of ordinary 2017a; Kairouz et al., 2021), use local client updates, where
differential equations (ODEs). First, an analytic the clients perform multiple epochs of stochastic gradient
framework is established to build a connection descent (SGD) on their local datasets to update their models
between federated optimization methods and de- before communicating to the server. This can dramatically
compositions of ODEs of corresponding central- reduce the communication required to train a FL model.
ized optimizers. Second, based on this analytic
framework, a momentum decoupling adaptive op- Despite achieving remarkable performance, FL techniques
timization method, F ED DA, is developed to fully often suffer from two key challenging issues: (1) Client
utilize the global momentum on each local itera- drift. Too many SGD epochs on the same clients may result
tion and accelerate the training convergence. Last in overfitting to their local datasets, such that the local mod-
but not least, full batch gradients are utilized to els are far away from globally optimal models and the FL
mimic centralized optimization in the end of the training achieves slower convergence (Karimireddy et al.,
training process to ensure the convergence and 2020d; Woodworth et al., 2020; Reddi et al., 2021a; Karim-
overcome the possible inconsistency caused by ireddy et al., 2021); and (2) lack of adaptivity. Standard
adaptive optimization methods. SGD optimization methods used in most FL models may be
unsuitable for federated settings and result in high commu-
nication costs (Zhang et al., 2019; Reddi et al., 2021a).
1. Introduction In centralized ML models, adaptive optimization methods,
Recent advances in federated learning (FL) present a promis- such as SGD with Momentum (SGDM) (Rumelhart et al.,
ing machine learning (ML) paradigm that enables collab- 1986; Qian, 1999; Sutskever et al., 2013), Adaptive Moment
orative training of shared ML models over multiple dis- Estimation (Adam) (Kingma & Ba, 2015), Adaptive Gra-
tributed devices without the need for data sharing, while dient (AdaGrad) (McMahan & Streeter, 2010; Duchi et al.,
mitigating the data isolation as well as protecting the data 2011b), etc., have achieved superior success in speeding up
privacy (Konečný et al., 2016b;a; McMahan et al., 2016; the training convergence. Adaptive optimization methods
are designed to control possible deviations of mini-batch
*
Equal contribution 1 Auburn University, USA 2 Sony AI, Japan gradients in centralized models. Several recent studies try
3
Baidu Research, China 4 University of Oregon, USA. Correspon- to adapt centralized optimization algorithms to the feder-
dence to: Yang Zhou <[email protected]>, Lingjuan Lyu ated learning settings for achieving faster convergence and
<[email protected]>.
higher test accuracy (Xie et al., 2019; Reddi et al., 2021a;
Proceedings of the 39 th International Conference on Machine Karimireddy et al., 2021; Wang et al., 2021b).
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-
right 2022 by the author(s). The above adaptive optimization methods can be broadly
classified into two categories: (1) Server adaptive meth- This work aims to develop novel adaptive optimization
ods. FedOpt is a generalization of FedAvg that allows the methods for FL from the perspective of the decomposi-
clients and the server to choose different optimization meth- tion of ODEs of centralized optimizers. We establish an
ods (Reddi et al., 2021a). Although FedOPT is a generic analytic framework to connect the federated optimization
framework that can use adaptive optimizers as client or methods with the decompositions of ODEs of correspond-
server optimizers, the core section (Section 3) of the Fe- ing centralized optimizers. The centralized gradient descent
dOPT paper only considers the setting that ClientOPT is (CGD) is utilized as a warm-up example to briefly illus-
SGD, which fails to make full use of the strength of the trate our underlying idea. The CGD reads W (t + 1) =
PM i
adaptive methods for improving the convergence (Wang W (t) − i=1 NN Li (W (t)) ∗ η, where W (t) is the global
et al., 2021b); (2) Client adaptive techniques. Local AdaAl- parameter, M is the number of clients,PN i is the number
ter proposes a novel SGD variant based on AdaGrad, and M
of training samples on ith device, N = i N i is the total
adopt the concept of local SGD to reduce the communi- i
number of training samples, L is the loss function on ith
i
cation (Xie et al., 2019). Mime uses a combination of PM
client. It is straightforward to check that i=1 NN Li (W (t))
control-variates and server-level optimizer state (e.g. mo- is the total loss of centralized training. Therefore, the CGD
mentum) at every client-update step to ensure that each local is the numerical solution of an ODE system dτ d
W (τ ) =
update mimics that of the centralized method run on i.i.d. PM N i i
− i=1 N L (W (τ )). One way to decompose the ODE
data (Karimireddy et al., 2021). However, Local AdaAl- PM N i i
ter and Mime use the same optimizer states (momentums system is as follows: W̄ (τ ) = i=1 N W (τ ), where
d
and pre-conditioners) on all clients, which is similar to W i (τ ) solves dτ W i (τ ) = −Li (W i (τ )), i = 1, · · · M .
server adaptive methods in FedOpt and does not exploit Thus, W̄ (τ ) is an approximate solution to the above ODE
local adaptivity (Wang et al., 2021b). FedLocal is the first system. The numerical solution of the decomposed sys-
real client adaptive approach that employs adaptive opti- tem is W i (t + 1) = W i (t) − ∇Li (W i (t)) ∗ η, and then
PM i
mization methods for local updates at clients (Wang et al., W̄ (t + 1) = i=1 NN W i (t) is an approximate numerical
2021b). It restarts the update of client optimizer states (i.e., solution to the ODE system of CGD. This scheme can be
momentums and pre-conditioners) at the beginning of each extended to the SGD case by replacing full batch gradients
round. This restart operation fails to inherit and aggregate with mini-batch gradients, which is exactly the update rule
the states from previous rounds and may loss the strength of FedAvg (McMahan et al., 2017a). This example illus-
of adaptive optimization methods in centralized models, trates how to derive federated optimization methods based
which aggregate the states from previous rounds to speed on the decomposition of ODEs of centralized optimizers.
up the convergence. For the standard FL models, applying Moreover, this example also demonstrates the rationality
adaptive optimization methods to local updates may come of FedAvg, since FedAvg is an approximation of the above
with the cost of a non-vanishing solution bias: instead of ODE system of CGD and the convergence of the CGD guar-
using current gradient to update local parameters in the antees the convergence of FedAvg if the approximation error
standard SGD, the adaptive methods utilize the aggrega- is well controlled. In the same spirit, to design adaptive op-
tion of client optimizer states (i.e., momentum aggregates timization methods for FL, it is crucial to discover suitable
previous and past gradients) in multiple client-update steps decompositions of ODEs of centralized adaptive optimizers.
to update local parameters. In this case, the heterogene-
Based on the above analytic framework, we develop a mo-
ity property in federated settings results in large deviations
mentum decoupling adaptive optimization method for FL.
among uploaded local gradients as well as makes the con-
By decoupling the global momentum from local updates, the
verge point far away from the global minimizer and slow
equation of global momentum becomes a linear equation.
down the convergence, compared with the standard SGD.
Consequently, we can decompose the equation of global
FedLocal proposes correction techniques to overcome this
momentum exactly and distribute the update of global mo-
bias issue and to complement the local adaptive methods for
mentum to local devices. The aggregation of the portions of
FL. However, the correction techniques that are essentially
the global momentum on local devices is exactly equal to
heuristic post-processing methods without knowledge of
the global momentum without any deviation. Particularly, in
previous client optimizer states on the server are hard to
our FedDA method: (1) the global momentums are updated
solve this problem fundamentally for stochastic optimiz-
with local gradients in each local iteration; (2) all global mo-
ers, such as SGDM, Adam, AdaGrad, etc. Therefore, the
mentums updated through local iterations will attend global
above techniques often follow manually-crafted heuristics
training to update global model. This is an analogy to cen-
to generalize centralized adaptive optimization methods to
tralized training, which fully utilizes the global momentum.
the federated settings. There is still a paucity of theoret-
Notice that momentums can provide fast convergence and
ical principles on where to and how to design and utilize
high accuracy for centralized training. The global momen-
adaptive optimization methods in federated settings.
tum in our FedDA method makes the best effort to mimic
the role of momentum in centralized training, which can 3. Decomposition of ODEs and FL
accelerate the convergence of FL training. We theoretically
demonstrate that (1) local momentum deviates from the cen- In this section, we establish the connection between the
tralized one at exponential rate; (2) global momentum in decomposition of ODEs of centralized optimizers and FL.
FedDA deviates from the centralized one at algebraic rate. As a warm up, we start with the most simple centralized
optimizer Gradient Descent. In particular, we demonstrate
Adaptive optimization methods for FL are often faced with that relation between the decomposition of the ODE for GD
convergence inconsistency in training (Wang et al., 2021b). and FedAvg, and explain why FedAvg works from ODE
We propose to utilize full batch gradients of clients to mimic theory.
centralized optimization in the end of the training process to
ensure the convergence. Based on our proposed framework Centralized training and FL share the same goal, that is to
of decomposing ODEs, if local devices only iterate once minimize the total loss L.
with full batch gradients in a training round, then our FedDA
M
method is in agreement with centralized training in that X Ni
round. Take the advantage of this, by reducing the iteration L(W ) = Li (W ). (2)
i=1
N
number to 1 in the end of training, our FedDA method
is able to mimic centralized training, which ensures the The most classical centralized optimization method is Gra-
convergence and overcome the inconsistency. dient Descent (GD):
Empirical evaluation on real federated tasks and datasets
demonstrates the superior performance of our momentum W (t + 1) = W (t) − ∇L(W (t)) ∗ η. (3)
decoupling adaptive optimization model against several
A more popular optimization is Stochastic Gradient Descent
state-of-the-art regular federated learning and federated opti-
(SGD). It updates the model with the gradient of the loss
mization approaches. In addition, more experiments, imple-
of a mini-batch at each step, which accelerates the training
mentation details, and hyperparameter selection and setting
process comparing to GD. Denote the loss of mini-batch
are presented in Appendices A.4-A.5.
at step t by Lt , then the training process of SGD can be
expressed by
2. Preliminaries and Notations
Federated learning aims to solve the following optimization W (t + 1) = W (t) − ∇Lt (W (t)) ∗ η. (4)
problem.
From the point of view of ODEs, the training process of GD
in Eq.(3) is the numerical solution of the autonomous ODE
M
X Ni d
min L(W ) = Li (W ) W (τ ) = −∇L W (τ ) ≤ 0. (5)
W ∈Rd N dτ
i=1
(1)
1 X The training process of SGD in Eq.(4) is as the numerical
where Li (W ) = lk (W ) solution of the non-autonomous ODE
Ni
k∈Pi
d
W (τ ) = −∇Lτ W (τ ) .

(6)
where lk (W ) = l(xk , yk ; W ) denotes the loss of the pre- dτ
diction on example (xk , yk ) made with model parameters
W . There are M clients over which the data is partitioned, In fact, Eq. (5) is a gradient system, i.e., the vector field
with Pi the set of indexes of data points on client S i , with of the equation −∇L W (τ ) is in the gradient form. As a
Ni = |Pi |. W i , Li , Gi , g i , and N i denote the local model gradient system, it is typical that the loss L descends along
parameter, loss function, full batch gradient, mini-batch gra- the solutions until it reaches local minimum. This is because
dient, number of training data on client S i respectively. N is
d
the total number of training data, i.e., N = N 1 + · · · + N M L W (τ ) = −|∇L W (τ ) |2 .

(7)
and η represents the learning rate. In each round, there are dτ
K clients participating in the training (K ≤ M ). The decentralization of the training process onto local de-
This work aims to derive the theoretical principle on where vices in FL is an analogy to the decomposition of Eq.(5)
to and how to design and utilize adaptive optimization meth- into a system of ODEs. Theoretically, a precise decompo-
PM i
ods in federated settings. Based on the theoretical principle, sition of Eq.(5) is W (τ ) = i=1 NN W i (τ ), where W i (τ )
it tries to propose an effective and efficient framework to satisfies
generalize adaptive optimization methods in centralized set- d i
W (τ ) = −∇Li W (τ ) , i = 1, · · · , M.

tings to FL with fast convergence and high accuracy. (8)
dτ
Numerical solutions to the above decomposed ODE system Comparing the above equation with Eq.(5), one has
is M
d X Ni
∇Li (W (τ ))

W (τ ) − W̄ (τ ) = −
W i (t + 1) = W i (t) − ∇Li W (t) ∗ η, i = 1, · · · , M.

dτ i=1
N (15)
(9a)
− ∇Li (W i (τ ) .

M
X Ni It is clear W (0) = W̄ (0). Integrating Eq.(15) from 0 to tη,
W (t + 1) = W i (t). (9b)
N it arrives
i=1
W (tη) − W̄ (tη)
Eq.(9a) is the local update rule and Eq.(9b) is the global
Z tη XM
aggregation rule. These update rules guarantees the training Ni (16)
∇Li (W (τ )) − ∇Li (W i (τ ) dτ.

process is equivalent to centralized GD. When local devices =−
0 i=1
N
are updated with mini-batch gradients, i.e.,
By a very rough estimate, we have that for t = 1, · · · , T
W i (t + 1) = W i (t) − g i W (t) ∗ η,

(10)
|W (tη) − W̄ (tη)| ≤ 2supk∇Li kL∞ tη, (17)
i
then this framework agrees with centralized SGD. Moreover,
if only part of the devices participate the training (either full which yields that W̄ (τ ) is a good approximation of W (τ )
batch or mini-batch) each round and aggregate after one iter- for τ ≤ T η when T η is small. Therefore, FL framework
ation as above, then it is also indistinguishable to centralized has to aggregate local models periodically and reset all
SGD. These optimization methods are essentially identical local and global parameters to be identical. By doing so,
to centralized optimization though they are in FL framework, the aggregations of local models are always stay close to
which theoretically are able to produce a global model as the centralized one. Since the centralized GD and SGD
good as centralized one. However, in Eq.(9a) ∇Li must be converge to local minimum of the total loss function, so
evaluate at W (t), i.e, global parameters W (t) are necessary does the one for FL. This method that performs SGD or GD
for the local update W i (t + 1) at any step t. Therefore, local on local devices and aggregates periodically is exactly the
models have to aggregate after each iteration, which will FedAvg. The above arguments also yields the underlying
cause expensive communication costs and poor efficiency. mechanism of FedAvg, which agrees with the essential idea
A redemption is to trade some accuracy for efficiency. To of the proof of convergence of FedAvg.
reduce communication cost, more local iterations have to Notice that the purpose of Eqs.(16) and (17) is to illustrate
be operated each round. Accordingly, Eq. (9a) is modified the idea of how to connect decomposition of ODEs to FL
to following approximate decomposition of Eq.(5) so that optimizations. We only used rough estimates instead of
more local iterations are admited. sharp ones to illustrate the underlying ideas more straight-
forwardly. Our FedDA method has never used the estimate
d i
W (τ ) = −∇Li W i (τ ) , in Eqs.(16) and (17) for attending any computations.

dτ (11)
W i (0) = W (0), i = 1, · · · , M. The superb performance of adaptive optimization methods,
such as SGDM, Adam, AdaGrad, have been demonstrated
Let T denote local iteration number. The corresponding for centralized training. Naturally, successful extensions of
numerical solution of Eq.(11) is these adaptive methods to FL are mostly desired.
W i (t+1) = W i (t)−∇Li (W i (t))∗η, t = 0, 1, · · · , T −1,

4. Momentum Decoupling Adaptive
(12)
which is exactly GD for S i . To further accelerate local Optimization: FedDA+SGDM
training, one may perform SGD for local devices, i.e., Recall the centralized SGDM.
W i (t + 1) = W i (t) − g i (W i (t)) ∗ η, t = 0, 1, · · · , T − 1. m(t + 1) = β ∗ m(t) + (1 − β) ∗ g(W (t)),
(18)
(13) W (t + 1) = W (t) − η ∗ m(t + 1),
The cost of this method is that the aggregation of W i where m is momentum, W is the parameter vector, g is
in Eq.(11) is not the solution to Eq.(5). Let W̄ (τ ) = the mini-batch gradient and η is the learning rate . The
PM N i i
i=1 N W (τ ) be the aggregation of the solution to corresponding ODE is a slow-fast system
Eq.(11). It is straightforward to check that
d
η m(τ ) = −(1 − β)m(τ ) + (1 − β)g(W (τ )),
M dτ
d X Ni (19)
W̄ (τ ) = − ∇Li (W i (τ )). (14) d
dτ N W (τ ) = −m(τ ).
i=1 dτ
d
m is a fast variable cause its derivative dτ m(τ ) is O(1/η) local training directly, it buffers local gradients generated in
and η is small. Therefore the rate of change of m is much each local iteration. We propose the following decoupled
faster than that of W . It is easy to check Eq.(18) is the decomposition of Eq.(19).
numerical solution of Eq.(19) using Euler’s scheme.
Similar to GD case, the numerical solution of a precise
d i
decomposition of SGDM is W (τ ) = −g i (W i (τ )),
dτ
M
mi (t + 1) = β ∗ mi (t) + (1 − β) ∗ g i (W (t)), d X Ni i
η m(τ ) = −(1 − β)m(τ ) + (1 − β) g (W i (τ ))
W i (t + 1) = W i (t) − η ∗ mi (t + 1), dτ i=1
N
M
Ni d
W (τ ) = −α ∗ m(τ ),
X
m(t + 1) = mi (t), (20) dτ
N
i=1 (22)
M i
X N
W (t + 1) = W i (t). where W i is the parameters of local devices, m is the global
i=1
N momentum, W is the parameters of global model and α is a
parameter to concede that local and global employ different
Again, the drawback is the local iteration number must be 1.
learning rates. Note that in Eq.(22), the local update of W i
To allow more local iterations, the most naive approach is
is independent of the global momentum m, this is why we
to perform SGDM on local devices directly
name this method decoupling method. The crucial point
mi (t + 1) = β ∗ mi (t) + (1 − β) ∗ g i (W i (t)), of our decoupling method is that since the the equations
(21) of W i (τ ) are independent of the global momentum m, the
W i (t + 1) = W i (t) − η ∗ mi (t + 1). equation of m is totally linear and can be precisely decom-
PM i
Then aggregate the above local updates periodically. This posed as m(τ ) = i=1 NN mi (τ ), where mi τ solves
method is indeed an approximation of centralized SGDM for d i
the same reason as in FedAvg case, and therefore this naive η m (τ ) = −(1 − β)mi (τ ) + (1 − β)∇g i (W i (τ )).
dτ
method may converge. However, the performance of this
naive method is not satisfying in experiments. The reasons Therefore, the system in Eq.(22) is exactly equivalent to
are twofold. First, the momentum is used to buff the devi-
ation of gradients. Decomposing the global momentum m d i
W (τ ) = −g i (W i (τ )),
onto local devices as in Eq.(21) will incapacitate it because dτ
PM N i i d
i=1 N m (t) is deviated from the momentum of central- η mi (τ ) = −(1 − β)mi (τ ) + (1 − β)g i (W i (τ )),
ized optimizer. Second, since the aggregated momentum is dτ
not accurate enough, it may also harm the local training if d
W (τ ) = −α ∗ m(τ ).
it is inherited by the local devices. This is why the restart dτ
(23)
local adaptive method works better (Wang et al., 2021b).
Therefore, though the Eq.(21) could be an approximation of That is we can calculate the global momentum m by pre-
centralized SGDM, it may need a rather small learning rate cisely decomposing it to local devices and the aggregation
to ensure the quality of approximation, which results in un- of the portion of the global momentum on local devices
satisfactory training performance. Local training generates mi will be exactly the global momentum. The numerical
local gradients deviating from the global ones, therefore a solution to Eq.(23) is
global momentum is favorable to control the deviations of
the gradients uploaded by local devices. FedOpt updates W i (t + 1) = W i (t) − g i (W i (t)) ∗ η, (24)
global momentum one time after the server receiving local i i i i
m (t + 1) = β ∗ m (t) + (1 − β) ∗ g (W (t))η, (25)
gradients generated by several local iterations each round.
M
Though it achieves great performance, it does not fully take X Ni i
the advantage of global momentum. In FedOpt the global m(t + 1) = m (t), (26)
i=1
N
momentum buffs the total gradients of multiple iterations.
However, for centralized SGDM, the momentum buffs the W (t + 1) = W (t) − m(t + 1) ∗ α ∗ η. (27)
gradient in each iteration. To let the global momentum fully
plays its role and accelerate the training of the global model, This numerical solution demonstrates our underlying ideas
we propose a decoupling adaptive technique (FedDA) to better. Eq.(24) yields that local update does not depend on
approximately mimic centralized SGDM. In our FedDA the global momentum m. However, the global momentum
method, though the global momentum does not participate buffs local gradients in each local iteration as shown in
Table 1: Statistics of the Datasets
Dataset #Training Clients #Test Clients #Training Samples #Test Samples

CIFAR-100 500 100 50,000 10,000
EMNIST-62 3,400 3,400 671,585 77,483
Stack Overflow 342,477 204,088 135,818,730 16,586,035
Eq.(26). Based on the numerical solution Eqs.(24)-(27), the Our proposed momentum decoupling adaptive optimization
algorithm of our FedDA+SGDM method is as follows. method has potential be extended to other federated learning
algorithms, e.g., FedProx, etc. Due to space limit, we only
At round E, pick participating clients S 1 , · · · , S k . For each
use FedDA+SGDM as an example to show how to apply
S i , i = 1, · · · , K, initialize P i = 0, W i (0) = W (E),
the FedDA optimization to FedProx. At round E, pick par-
mi (0) = m(E). For t = 0, T − 1,
ticipating clients S 1 , · · · , S k . For each S i , i = 1, · · · , K,
initialize P i = 0, W i (0) = W (E), mi (0) = m(E). For
W i (t + 1) = W i (t) − g i (W i (t)) ∗ η,
t = 0, · · · , T − 1, W i (t + 1) = W i (t) − g i (W i (t)) + µ ∗
mi (t + 1) = βmi (t) + (1 − β)g i (W i (t)), (28) (W i (t) − W (E) ∗ η, mi (t + 1) = β ∗ mi (t) + (1 − β) ∗
i i i
P = P + m (t + 1). g i (W i (t)) + µ ∗ (W i (t) − W (E) , P i = P i + mi (t + 1).
The global update rule is the same as the one in Eq.(29). By
The global update rule is: following the similar strategy, other two versions of FedDA
(FedDA+ADAM and FedDA+AdaGrad) can be added to
P = aggregation of P i , FedProx and other regular FL algorithms, which demon-
m(E + 1) = aggregation of mi (T ), (29) strates the applicability and generality of FedDA.
W (E + 1) = W (E) − P ∗ α ∗ η.
5. Experiments
Notice that P is the summation of global momentums up- In this section, we have evaluated the performance of our
dated during local iterations, and global parameter W is FedDA model and other comparison methods in serval rep-
updated by P . This mimics the centralized SGDM. We resentative federated datasets and learning tasks to date. We
theoretically demonstrate that (1) local momentum deviates show that FedDA with decoupling and full batch gradient
from the centralized one at exponential rate O(eλt ); and (2) techniques is able to achieve faster convergence and higher
global momentum in FedDA deviates from the centralized test accuracy in cross-device settings against several state-
one at algebraic rate O(t2 ). Please refer to Appendix A.3 of-the-art federated optimization methods. The experiments
for detailed proof. exactly follow the same settings described by a recent feder-
ated optimization method, FedOpt (Reddi et al., 2021a).
Full batch gradient stabilization. Adaptive optimization
methods for FL are often faced with inconsistency of conver- Datasets. We focus on three popular computer vision and
gence in training (Wang et al., 2021b). When the training natural language processing tasks over three representative
is almost finished, i.e, W is close to the local minimum benchmark datasets respectively: (1) image classification
point of loss function, the gradient has to be more accurate over CIFAR-100 (Krizhevsky, 2009). We train ResNet-18
to ensure the convergence to local minimum point. Not by replacing batch norm with group norm (Hsieh et al.,
sufficiently accurate gradients will lead to unstable conver- 2020); (2) image classification over EMNIST (Hsieh et al.,
gence and inconsistency. Based on our framework, if local 2020). We train a CNN for character recognition; and (3)
devices only iterate once with full batch gradients in a train- text classification over Stack Overflow (TensorFlow, 2019).
ing round, then our FedDA method is in agreement with We perform tag prediction using logistic regression on bag-
centralized training in that round. Therefore, by reducing of-words vectors. We select the 10,000 most frequently
the iteration number to 1 in the end of training, our FedDA used words, the 500 most frequent tags and a one-versus-
method mimics centralized training, which ensures the con- rest classification strategy, by following the same strategy
vergence and overcomes the inconsistency of training. Last in FedOpt (Reddi et al., 2021a). The detailed descriptions
but not least, since this full batch method is utilized only of the federated datasets and learning tasks are presented in
in the end of the training when the parameter W is fairly Appendix A.5.
close to local minimum point, the training converges and Baselines. We compare the FedDA model with nine state-
stabilized rather quickly. Therefore, it is a bargain to employ of-the-art federated learning models, including five regu-
this full batch method in the end of the training to guarantee lar federated learning and four federated optimization ap-
the convergence of the training to local minimum point.
0.5 5.5 16 5.5

FedLocal 0.5 FedLocal FedLocal FedLocal FedLocal
0.5 SCAFFOLD SCAFFOLD SCAFFOLD SCAFFOLD SCAFFOLD
FedLin FedLin FedLin 5.0 FedLin 14 FedLin 5.0
FedAvg FedAvg 0.4 FedAvg FedAvg FedAvg
0.4 MFL 0.4 MFL MFL MFL MFL
4.5 12 4.5
CLIMB CLIMB CLIMB CLIMB CLIMB
STEM STEM 0.3 STEM STEM STEM
0.3 4.0 10 4.0
0.3 MimeLite MimeLite MimeLite MimeLite MimeLite FedLocal
FedOpt FedOpt FedOpt FedOpt FedOpt SCAFFOLD
FedDA FedDA FedDA 3.5 FedDA 8 FedDA 3.5 FedLin
0.2 0.2 0.2 FedAvg
3.0 6 3.0 MFL
CLIMB
0.1 0.1 0.1 2.5 STEM
4 2.5 MimeLite
FedOpt
2.0 2 2.0 FedDA
0.0 0.0 0.0
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
(a) SGDM (b) Adam (c) AdaGrad (a) SGDM (b) Adam (c) AdaGrad
Figure 1: Convergence on CIFAR-100 with Three Optimizers Figure 2: Loss on CIFAR-100 with Three Optimizers
16 FedLocal 16 FedLocal 16
0.8 0.8 0.8 SCAFFOLD SCAFFOLD
14 FedLin 14 FedLin 14
FedAvg FedAvg
FedLocal FedLocal FedLocal 12 MFL 12 MFL 12 FedLocal
0.6 SCAFFOLD 0.6 SCAFFOLD 0.6 SCAFFOLD CLIMB CLIMB SCAFFOLD
FedLin FedLin FedLin 10 STEM 10 STEM 10 FedLin
FedAvg FedAvg FedAvg MimeLite MimeLite FedAvg
MFL MFL MFL FedOpt FedOpt MFL
CLIMB CLIMB CLIMB 8 8 8 CLIMB
0.4 0.4 0.4 FedDA FedDA
STEM STEM STEM STEM
MimeLite MimeLite MimeLite 6 6 6 MimeLite
FedOpt FedOpt FedOpt FedOpt
0.2 FedDA 0.2 FedDA 0.2 FedDA 4 4 4 FedDA
2 2 2
0.0 0.0 0.0 0 0 0
0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400
(a) SGDM (b) Adam (c) AdaGrad (a) SGDM (b) Adam (c) AdaGrad
Figure 3: Convergence on EMNIST with Three Optimizers Figure 4: Loss on EMNIST with Three Optimizers
proaches. FedAvg is a classical as well as practical method Local) proposes techniques that enable the use of adaptive
for the federated learning of deep networks based on iter- optimization methods for local updates at clients in feder-
ative model averaging (McMahan et al., 2017a). SCAF- ated learning (Wang et al., 2021b). To our best knowledge,
FOLD is a algorithm which uses control variates to correct this work is the first to certify the group fairness of classi-
for the client-drift in its local updates (Karimireddy et al., fiers with theoretical input-agnostic guarantees, while there
2020d). FedLin is an algorithmic framework to tackle the is no need to know the shift between training and deploy-
key challenges of objective heterogeneity, systems hetero- ment datasets with respect to sensitive attributes. All nine
geneity, and imprecise communication in FL (Mitra et al., baselines used in our experiments either do not use momen-
2021a). STEM is a stochastic two-sided momentum algo- tum, or use momentum without momentum aggregation,
rithm, that utilizes certain momentum-assisted stochastic or use momentum with aggregation but restart momentum
gradient directions for both the client and server updates (Mi- at each FL round (FedLocal). Our FedDA method keeps
tra et al., 2021a). CLIMB is an agnostic constrained learn- the momentum aggregation in the entire FL process, which
ing formulation to tackle the class imbalance problem in results in faster convergence but larger oscillation.
FL without requiring further information beyond the stan-
Evaluation metrics. We use two popular measures in fed-
dard FL objective (?). MFL performs momentum gradi-
erated learning and plot the measure curves with increasing
ent descent in local update step of FL system to solve the
training rounds to verify the convergence of different meth-
distributed machine learning problem (Liu et al., 2020a).
ods: Accuracy and Loss Function Values (Loss) (Karim-
FedOpt is a flexible algorithmic framework that allows the
ireddy et al., 2020d; Mitra et al., 2021a; Liu et al., 2020a;
clients and the server to choose different optimization meth-
Reddi et al., 2021a; Karimireddy et al., 2021; Wang et al.,
ods more general than stochastic gradient descent (SGD)
2021b). A larger Accuracy or a smaller Loss score indicates
in FedAvg (Reddi et al., 2021a). MimeLite uses a combi-
a better federated learning result. We run 1,500 rounds of
nation of control-variates and server-level optimizer state
training on the EMNIST and Stack Overflow datasets, and
at every client-update step to ensure that each local up-
4,000 rounds over the CIFAR-100 dataset. In addition, we
date mimics that of the centralized method run on i.i.d.
use final Accuracy to evaluate the quality of the federated
data (Karimireddy et al., 2021). Local Adaptivity (Fed-
learning algorithms.
Table 2: Final Accuracy on CIFAR-100 Table 3: Final Accuracy on EMNIST
Optimizer SGDM Adam AdaGrad Optimizer SGDM Adam AdaGrad

FedLocal 0.384 0.009 0.113 FedLocal 0.834 0.055 0.806
SCAFFOLD 0.010 0.010 0.010 SCAFFOLD 0.794 0.794 0.794
FedLin 0.440 0.440 0.440 FedLin 0.805 0.805 0.805
FedAvg 0.324 0.324 0.324 FedAvg 0.850 0.850 0.850
MFL 0.293 0.346 0.135 MFL 0.848 0.055 0.047
CLIMB 0.010 0.010 0.010 CLIMB 0.843 0.843 0.843
STEM 0.014 0.014 0.014 STEM 0.051 0.051 0.051
MimeLite 0.427 0.009 0.009 MimeLite 0.835 0.851 0.821
FedOpt 0.425 0.443 0.301 FedOpt 0.838 0.847 0.840
FedDA 0.518 0.510 0.488 FedDA 0.860 0.853 0.868
Table 4: Final Mean Squared Error on EMNIST for Autoen- Accuracy curves. In most experiments, our FedDA fed-
coder erated optimization method is able to achieve the fastest
convergence, especially, FedDA can converge within less
Optimizer SGDM Adam AdaGrad than 200 training rounds and then always keep stable on the
FedLocal 0.0169 0.0289 0.0168 EMNIST dataset. A reasonable explanation is that FedDA
FedAvg 0.0171 0.0171 0.0171 fully utilizes the global momentum on each local iteration as
MFL 0.0168 0.0290 0.0291 well as employ full batch gradients to mimic centralized op-
MimeLite 0.0183 0.0307 0.0287 timization in the end of the training process for accelerating
FedOpt 0.0175 0.0173 0.0145 the training convergence.
FedDA 0.0167 0.0166 0.0132 Final Accuracy on CIFAR-100 and EMNIST. Tables 2-
4 show the quality of ten federated learning algorithms
over CIFAR-100 and EMNIST respectively. Notice that
the performance achieved by five regular federated learn-
Convergence on CIFAR-100 and EMNIST. Figures 1 and ing algorithms, including FedAvg, SCAFFOLD, FedLin,
3 exhibit the Accuracy curves of ten federated learning STEM, and CLIMB, keep unchanged when using different
models for image classification over CIFAR-100 and charac- optimizers, such as SGDM, Adam, and AdaGrad, while
ter recognition on EMNIST respectively. It is obvious that five federated optimization approaches, including MFL, Fe-
the performance curves by federated learning algorithms dOpt, Mime, FedLocal, and our FedDA model obtain dif-
initially keep increasing with training rounds and remains ferent performance. We have observed that our FedDA
relatively stable when the curves are beyond convergence federated optimization solution outperforms the competitor
points, i.e., turning points from a sharp Accuracy increase methods in most experiments. FedDA achieves the highest
to a flat curve. This phenomenon indicates that most feder- Accuracy values (> 0.488 over CIFAR-100 and > 0.853
ated learning algorithms are able to converge to the invariant on EMNIST respectively), which are better than other nine
solutions after enough training rounds. However, among baseline methods in all tests. A reasonable explanation is
five regular federated learning and five federated optimiza- that the combination of decoupling and full batch gradient
tion approaches, our FedDA federated optimization method techniques is able to achieve faster convergence as well as
can significantly speedup the convergence on two datasets higher test accuracy in cross-device settings. In addition,
in most experiments, showing the superior performance of the promising performance of FedDA over both datasets
FedDA in federated settings. Compared to the learning re- implies that FedDA has great potential as a general feder-
sults by other federated learning models, based on training ated optimization solution to learning tasks over federated
rounds at convergence points, FedDA, on average, achieves datasets, which is desirable in practice.
34.3% and 22.6% convergence improvement on two datasets
respectively. Impact of local iteration numbers. Figure 5 shows the im-
pact of the numbers of local iterations in our FedDA model
Loss on CIFAR-100 and EMNIST. Figures 2 and 4 with the adaptive optimizers over the datasets of CIFAR-
present the Loss curves achieved by ten federated learn- 100, EMNIST, and Stack Overflow respectively. The perfor-
ing models on two datasets respectively. We have observed mance curves initially raise when the local iteration number
obvious that the reverse trends, in comparison with the
0.8 0.8
0.5 0.6 0.6
0.7 0.4 0.7
0.5 0.5
Final Accuracy
Final Accuracy
Final Accuracy
Final Accuracy
Final Accuracy
Final Accuracy
0.4 0.6 0.6
0.4 0.3 0.4
0.5 0.5
0.3
0.4 0.3 0.4 0.3
0.2
0.2 0.3 0.3
0.2 0.2
SGDM 0.2 SGDM SGDM 0.1 SGDM 0.2 SGDM SGDM
0.1 0.1 0.1
Adam 0.1 Adam Adam Adam 0.1 Adam Adam
AdaGrad AdaGrad AdaGrad AdaGrad AdaGrad AdaGrad
0 0 0 0 0 0
1 5 10 20 1 5 10 20 1 5 10 20 100 500 1000 1500 2000 10 50 100 500 1000 50 100 200 400 1000
Local Iteration Number Local Iteration Number Local Iteration Number Training Round Training Round Training Round
(a) CIFAR-100 (b) EMNIST (c) Stack Overflow (a) CIFAR-100 (b) EMNIST (c) Stack Overflow
Figure 5: Final Accuracy with Varying #Local Iterations Figure 6: Final Accuracy with Varying Training Rounds
increases and then keep relatively stable or even drop if the in federated settings. First, we establish a connection be-
local iteration number keeps increasing. This demonstrates tween federated optimization methods and decompositions
that there must exist a suitable local iteration number for of ODEs of corresponding centralized optimizers. Second,
the FL training. A too large number may make the clients we develop a momentum decoupling adaptive optimization
overfit to their local datasets, such that the local models are method to well balance fast convergence and high accu-
far away from globally optimal models and the FL training racy in FL. Finally, we utilize full batch gradient to mimic
achieves slower convergence. On the other hand, a too small centralized optimization in the end of the training process.
number may result in slow convergence of local training and
also increase the difficulty of convergence of global training. References
Thus, it is important to choose the appropriate numbers for Acar, D. A. E., Zhao, Y., Matas, R., Mattina, M., What-
well balancing the local training and global training. Notice mough, P., and Saligrama, V. Federated learning based
that the final accuracy of AdaGrad and SGDM is closed to on dynamic regularization. In Int. Conf. on Learning
0 on the EMNIST dataset when the local iteration number Representations (ICLR), pp. 1–36, 2021.
is larger than 10. A reasonable explanation is EMNIST is
a simple dataset and a large local iteration number makes Afonin, A. and Karimireddy, S. P. Towards model agnostic
local models converge to their local minimum, which may federated learning using knowledge distillation. arXiv
be distant from the minimum of global model. This leads to preprint arXiv:2110.15210, 2021.
lower accuracy of global model.
Anonymous. Acceleration of federated learning with allevi-
Impact of training round numbers. Figures 6 (a)-(c) ated forgetting in local training. In Submitted to Int. Conf.
present the performance achieved by our FedDA method on Learning Representations (ICLR), pp. 1–19, 2022a.
with varying the numbers of training rounds from 100 to under review.
2,000, from 10 to 1,000, and from 50 to 1,000 on three
datasets. It is obvious that the performance curves with Anonymous. An agnostic approach to federated learning
each optimizer keep increasing with the increased number with class imbalance. In Submitted to Int. Conf. on Learn-
of training rounds. This phenomenon indicates that the accu- ing Representations (ICLR), pp. 1–12, 2022b. under
racy in the federated settings are sensitive to training rounds. review.
This is because the special data and system heterogeneity Anonymous. AQUILA: Communication efficient federated
issues in the FL increase the difficulty in converging in a learning with adaptive quantization of lazily-aggregated
short time and the FL models need more training rounds to gradients. In Submitted to Int. Conf. on Learning Repre-
obtain the desired learning results. However, as shown in sentations (ICLR), pp. 1–23, 2022c. under review.
the above experiments of convergence in Figures 1-4, our
FedDA method presents superior convergence performance, Anonymous. Hybrid local SGD for federated learning with
compared with other FL algorithms, including both regular heterogeneous communications. In Submitted to Int. Conf.
federated learning and federated optimization approaches. on Learning Representations (ICLR), pp. 1–41, 2022d.
under review.
6. Conclusions Anonymous. Improving federated learning face recognition
This work presents a theoretical principle on where to and via privacy-agnostic clusters. In Submitted to Int. Conf.
how to design and utilize adaptive optimization methods on Learning Representations (ICLR), pp. 1–21, 2022e.
under review.
Anonymous. Privacy-preserving task-agnostic vision trans- Gao, H., Xu, A., and Huang, H. On the convergence of
former for image processing. In Submitted to Int. Conf. communication-efficient local sgd for federated learning.
on Learning Representations (ICLR), pp. 1–28, 2022f. In AAAI Conf. on Artificial Intelligence, volume 35, pp.
under review. 7510–7518, 2021.
Anonymous. Recycling model updates in federated learn- Goswami, S., Pokhrel, A., Lee, K., Liu, L., Zhang, Q., and
ing: Are gradient subspaces low-rank? In Submitted to Zhou, Y. Graphmap: Scalable iterative graph processing
Int. Conf. on Learning Representations (ICLR), pp. 1–71, using nosql. The Journal of Supercomputing (TJSC), 76
2022g. under review. (9):6619–6647, 2020.
Anonymous. Unsupervised federated learning is possible. Guimu Guo, Da Yan, L. Y. J. K. C. L. Z. J. and Zhou, Y.
In Submitted to Int. Conf. on Learning Representations Maximal directed quasi-clique mining. In Proceedings
(ICLR), pp. 1–22, 2022h. under review. of the 38th IEEE International Conference on Data Engi-
neering (ICDE’22), Kuala Lumpur, Malaysia, May 9-12
Balunović, M., Dimitrov, D. I., Staab, R., and Vechev, M.
2022.
Bayesian framework for gradient leakage. arXiv preprint
arXiv:2111.04706, 2021. Hamer, J., Mohri, M., and Suresh, A. T. Fedboost: A
communication-efficient algorithm for federated learning.
Bao, X., Liu, L., Xiao, N., Zhou, Y., and Zhang, Q. Policy-
In Proceedings of the 37th International Conference on
driven autonomic configuration management for nosql. In
Machine Learning, ICML 2020, 13-18 July 2020, Virtual
Proceedings of the 2015 IEEE International Conference
Event, pp. 3973–3983, 2020a.
on Cloud Computing (CLOUD’15), pp. 245–252, New
York, NY, June 27-July 2 2015. Hamer, J., Mohri, M., and Suresh, A. T. FedBoost: A
communication-efficient algorithm for federated learning.
Caldas, S., Konečný, J., McMahan, H. B., and Talwalkar, A.
In Int. Conf. on Machine Learning (ICML), volume 119,
Expanding the reach of federated learning by reducing
pp. 3973–3983, 2020b.
client resource requirements. CoRR, abs/1812.07210,
2018a. He, C., Annavaram, M., and Avestimehr, S. Group knowl-
edge transfer: Federated learning of large cnns at the edge.
Caldas, S., Wu, P., Li, T., Konečný, J., McMahan, H. B.,
In Advances in Neural Information Processing Systems
Smith, V., and Talwalkar, A. LEAF: A benchmark for
33: Annual Conference on Neural Information Process-
federated settings. CoRR, abs/1812.01097, 2018b.
ing Systems 2020, NeurIPS 2020, December 6-12, 2020,
Chen, H.-Y. and Chao, W.-L. On bridging generic virtual, 2020.
and personalized federated learning. arXiv preprint
Hong, J., Wang, H., Wang, Z., and Zhou, J. Federated
arXiv:2107.00778, 2021.
robustness propagation: Sharing adversarial robustness
Chen, M., Poor, H. V., Saad, W., and Cui, S. Convergence in federated learning. arXiv preprint arXiv:2106.10196,
time minimization of federated learning over wireless 2021.
networks. In IEEE Int. Conf. on Communications (ICC),
Hsieh, K., Phanishayee, A., Mutlu, O., and Gibbons, P. B.
pp. 1–6, 2020.
The non-iid data quagmire of decentralized machine learn-
Dai, Z., Low, B. K. H., and Jaillet, P. Federated bayesian ing. In Proceedings of the 37th International Conference
optimization via thompson sampling. In Annual Conf. on on Machine Learning, ICML 2020, 13-18 July 2020, Vir-
Neural Information Processing Systems (NeurIPS), pp. tual Event, volume 119 of Proceedings of Machine Learn-
1–13, 2020. ing Research, pp. 4387–4398. PMLR, 2020.
Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient Hyeon-Woo, N., Ye-Bin, M., and Oh, T.-H. Fedpara: Low-
methods for online learning and stochastic optimization. rank hadamard product for communication-efficient fed-
Journal of machine learning research, 12(7), 2011a. erated learning. arXiv preprint arXiv:2108.06098, 2021.
Duchi, J. C., Hazan, E., and Singer, Y. Adaptive subgradient Ingerman, A. and Ostrowski, K. Introducing tensorflow
methods for online learning and stochastic optimization. federated. https://medium.com/tensorflow/
J. Mach. Learn. Res., 12:2121–2159, 2011b. introducing-tensorflow-federated, 2019.
Fowl, L., Geiping, J., Czaja, W., Goldblum, M., and Gold- Jiang, Y., Perng, C.-S., Sailer, A., Silva-Lepe, I., Zhou, Y.,
stein, T. Robbing the fed: Directly obtaining private and Li, T. Csm: A cloud service marketplace for complex
data in federated learning with modified models. arXiv service acquisition. ACM Transactions on Intelligent
preprint arXiv:2110.13057, 2021. Systems and Technology (TIST), 8(1):1–25, 2016.
Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, Kingma, D. P. and Ba, J. Adam: A method for stochastic
M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, optimization. arXiv preprint arXiv:1412.6980, 2014.
G., Cummings, R., et al. Advances and open problems
in federated learning. arXiv preprint arXiv:1912.04977, Kingma, D. P. and Ba, J. Adam: A method for stochastic
2019. optimization. In Bengio, Y. and LeCun, Y. (eds.), 3rd
International Conference on Learning Representations,
Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Confer-
M., Bhagoji, A. N., Bonawitz, K. A., Charles, Z., Cor- ence Track Proceedings, 2015.
mode, G., Cummings, R., D’Oliveira, R. G. L., Eichner,
H., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Konečný, J., McMahan, H. B., Ramage, D., and Richtárik, P.
Gascón, A., Ghazi, B., Gibbons, P. B., Gruteser, M., Har- Federated optimization: Distributed machine learning for
chaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, on-device intelligence. CoRR, abs/1610.02527, 2016a.
J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Konečný, Konečný, J., McMahan, H. B., Yu, F. X., Richtárik, P.,
J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Suresh, A. T., and Bacon, D. Federated learning: Strate-
Liu, Y., Mittal, P., Mohri, M., Nock, R., Özgür, A., Pagh, gies for improving communication efficiency. In NIPS
R., Qi, H., Ramage, D., Raskar, R., Raykova, M., Song, Workshop on Private Multi-Party Machine Learning,
D., Song, W., Stich, S. U., Sun, Z., Suresh, A. T., Tramèr, 2016b.
F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q.,
Yu, F. X., Yu, H., and Zhao, S. Advances and open prob- Krizhevsky, A. Learning multiple layers of features from
lems in federated learning. Found. Trends Mach. Learn., tiny images. Technical Report, 2009.
14(1-2):1–210, 2021.
Lee, K., Liu, L., Tang, Y., Zhang, Q., and Zhou, Y. Efficient
Karimireddy, S. P., He, L., and Jaggi, M. Byzantine-robust and customizable data partitioning framework for dis-
learning on heterogeneous datasets via bucketing. arXiv tributed big rdf data processing in the cloud. In Proceed-
preprint arXiv:2006.09365, 2020a. ings of the 2013 IEEE International Conference on Cloud
Computing (CLOUD’13), pp. 327–334, Santa Clara, CA,
Karimireddy, S. P., Jaggi, M., Kale, S., Mohri, M., Reddi, June 27-July 2 2013.
S. J., Stich, S. U., and Suresh, A. T. Mime: Mimicking
centralized stochastic algorithms in federated learning. Lee, K., Liu, L., Schwan, K., Pu, C., Zhang, Q., Zhou, Y.,
arXiv preprint arXiv:2008.03606, 2020b. Yigitoglu, E., and Yuan, P. Scaling iterative graph compu-
tations with graphmap. In Proceedings of the 27th IEEE
Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, international conference for High Performance Comput-
S., and Suresh, A. T. Scaffold: Stochastic controlled ing, Networking, Storage and Analysis (SC’15), pp. 57:1–
averaging for federated learning. In Int. Conf. on Machine 57:12, Austin, TX, November 15-20 2015.
Learning (ICML), pp. 5132–=–5143, 2020c.
Lee, K., Liu, L., Ganti, R. L., Srivatsa, M., Zhang, Q., Zhou,
Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S. J., Stich, Y., and Wang, Q. Lightwieight indexing and querying ser-
S. U., and Suresh, A. T. SCAFFOLD: stochastic con- vices for big spatial data. IEEE Transactions on Services
trolled averaging for federated learning. In Proceedings Computing (TSC), 12(3):343–355, 2019.
of the 37th International Conference on Machine Learn-
ing, ICML 2020, 13-18 July 2020, Virtual Event, volume Leroy, D., Coucke, A., Lavril, T., Gisselbrecht, T., and
119 of Proceedings of Machine Learning Research, pp. Dureau, J. Federated learning for keyword spotting. In
5132–5143. PMLR, 2020d. IEEE Int. Conf. on Acoustics, Speech and Signal Process-
ing (ICASSP), pp. 6341–6345, 2019.
Karimireddy, S. P., Jaggi, M., Kale, S., Mohri, M., Reddi,
S. J., Stich, S. U., and Suresh, A. T. Mime: Mimicking Li, W. and McCallum, A. Pachinko allocation: Dag-
centralized stochastic algorithms in federated learning. structured mixture models of topic correlations. In Cohen,
In 9th International Conference on Learning Representa- W. W. and Moore, A. W. (eds.), Machine Learning, Pro-
tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, ceedings of the Twenty-Third International Conference
2021. (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-
29, 2006, volume 148 of ACM International Conference
Khanduri, P., Sharma, P., Yang, H., Hong, M., Liu, J., Proceeding Series, pp. 577–584. ACM, 2006.
Rajawat, K., and Varshney, P. K. Stem: A stochastic
two-sided momentum algorithm achieving near-optimal Li, Z., Kovalev, D., Qian, X., and Richtarik, P. Acceleration
sample and communication complexities for federated for compressed gradient descent in distributed and fed-
learning. In Annual Conf. on Neural Information Process- erated optimization. In Int. Conf. on Machine Learning
ing Systems (NeurIPS), pp. 1–12, 2021. (ICML), volume 119, pp. 5895–5904, 2020.
Liu, J., Huang, J., Zhou, Y., Li, X., Ji, S., Xiong, H., and Mitra, A., Jaafar, R., Pappas, G., and Hassani, H. Linear
Dou, D. From distributed machine learning to federated convergence in federated learning: Tackling client hetero-
learning: A survey. arXiv preprint arXiv:2104.14362, geneity and sparse gradients. In The 35th Conference on
2021. Neural Information Processing Systems, (NeurIPS’21),
Online, December 6-14 2021a.
Liu, J., Huang, J., Zhou, Y., Li, X., Ji, S., Xiong, H., and
Dou, D. From distributed machine learning to federated Mitra, A., Jaafar, R., Pappas, G. J., and Hassani, H. Lin-
learning: A survey. Knowledge and Information Systems ear convergence in federated learning: Tackling client
(KAIS), 64(4):885–917, 2022. heterogeneity and sparse gradients. In Annual Conf. on
Neural Information Processing Systems (NeurIPS), pp.
Liu, W., Chen, L., Chen, Y., and Zhang, W. Accelerat- 1–14, 2021b.
ing federated learning via momentum gradient descent.
IEEE Trans. Parallel Distributed Syst., 31(8):1754–1766, Oh, J., Kim, S., and Yun, S.-Y. Fedbabu: Towards enhanced
2020a. representation for federated image classification. arXiv
preprint arXiv:2106.06042, 2021.
Liu, W., Chen, L., Chen, Y., and Zhang, W. Accelerating
federated learning via momentum gradient descent. IEEE Ozfatura, E., Ozfatura, K., and Gündüz, D. Fedadc: Accel-
Transactions on Parallel and Distributed Systems (TPDS), erated federated learning with drift control. In IEEE Int.
31(8):1754–1766, 2020b. Symposium on Information Theory (ISIT), pp. 467–472,
2021.
Liu, Y., Kang, Y., Zhang, X., Li, L., Cheng, Y., Chen, T.,
Hong, M., and Yang, Q. A communication efficient verti- Palanisamy, B., Liu, L., Lee, K., Meng, S., Tang, Y., and
cal federated learning framework. CoRR, abs/1912.11187, Zhou, Y. Anonymizing continuous queries with delay-
2019. tolerant mix-zones over road networks. Distributed and
Parallel Databases (DAPD), 32(1):91–118, 2014.
McMahan, B., Moore, E., Ramage, D., Hampson, S., and
y Arcas, B. A. Communication-efficient learning of deep Palanisamy, B., Liu, L., Zhou, Y., and Wang, Q. Privacy-
networks from decentralized data. In Singh, A. and Zhu, preserving publishing of multilevel utility-controlled
X. J. (eds.), Proceedings of the 20th International Con- graph datasets. ACM Transactions on Internet Technology
ference on Artificial Intelligence and Statistics, AISTATS (TOIT), 18(2):24:1–24:21, 2018.
2017, 20-22 April 2017, Fort Lauderdale, FL, USA, vol- Pathak, R. and Wainwright, M. J. Fedsplit: an algorith-
ume 54 of Proceedings of Machine Learning Research, mic framework for fast federated optimization. In An-
pp. 1273–1282. PMLR, 2017a. nual Conf. on Neural Information Processing Systems
(NeurIPS), volume 33, pp. 7057–7066, 2020.
McMahan, B., Moore, E., Ramage, D., Hampson, S., and
y Arcas, B. A. Communication-efficient learning of deep Qian, N. On the momentum term in gradient descent learn-
networks from decentralized data. In Artificial Intelli- ing algorithms. Neural Networks, 12(1):145–151, 1999.
gence and Statistics, pp. 1273–1282, 2017b.
Reddi, S., Zaheer, M., Sachan, D., Kale, S., and Kumar,
McMahan, H. B. and Streeter, M. J. Adaptive bound opti- S. Adaptive methods for nonconvex optimization. In
mization for online convex optimization. In Kalai, A. T. Annual Conf. on Neural Information Processing Systems
and Mohri, M. (eds.), COLT 2010 - The 23rd Conference (NeurIPS), pp. 1–17, 2018.
on Learning Theory, Haifa, Israel, June 27-29, 2010, pp.
244–256. Omnipress, 2010. Reddi, S. J., Charles, Z., Zaheer, M., Garrett, Z., Rush, K.,
Konečný, J., Kumar, S., and McMahan, H. B. Adaptive
McMahan, H. B., Moore, E., Ramage, D., and y Arcas, federated optimization. In 9th International Conference
B. A. Federated learning of deep networks using model on Learning Representations, ICLR 2021, Virtual Event,
averaging. CoRR, abs/1602.05629, 2016. Austria, May 3-7, 2021, 2021a.
Mills, J., Hu, J., and Min, G. Communication-efficient Reddi, S. J., Charles, Z., Zaheer, M., Garrett, Z., Rush, K.,
federated learning for wireless edge intelligence in iot. Konečný, J., Kumar, S., and McMahan, H. B. Adap-
IEEE Internet of Things Journal, 7(7):5986–5994, 2019. tive federated optimization. In Int. Conf. on Learning
Representations (ICLR), 2021b.
Mills, J., Hu, J., Min, G., Jin, R., Zheng, S., and Wang,
J. Accelerating federated learning with a global biased Rothchild, D., Panda, A., Ullah, E., Ivkin, N., Stoica, I.,
optimiser. arXiv preprint arXiv:2108.09134, 2021. Braverman, V., Gonzalez, J., and Arora, R. Fetchsgd:
Communication-efficient federated learning with sketch- Wang, J., Xu, Z., Garrett, Z., Charles, Z., Liu, L., and
ing. In Int. Conf. on Machine Learning (ICML), pp. 8253– Joshi, G. Local adaptivity in federated learning: Conver-
8265, 2020. gence and consistency. arXiv preprint arXiv:2106.02305,
2021c.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learn-
ing internal representations by error propagation. In Woodworth, B. E., Patel, K. K., and Srebro, N. Minibatch
Rumelhart, D. E. and McClelland, J. L. (eds.), Parallel vs local SGD for heterogeneous distributed learning. In
Distributed Processing. MIT Press, 1986. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and
Lin, H. (eds.), Advances in Neural Information Process-
Sharma, A., Chen, W., Zhao, J., Qiu, Q., Chaterji, S., and ing Systems 33: Annual Conference on Neural Informa-
Bagchi, S. Tesseract: Gradient flip score to secure fed- tion Processing Systems 2020, NeurIPS 2020, December
erated learning against model poisoning attacks. arXiv 6-12, 2020, virtual, 2020.
preprint arXiv:2110.10108, 2021.
Wu, H. and Wang, P. Fast-convergent federated learning
Su, Z., Liu, L., Li, M., Fan, X., and Zhou, Y. Servicetrust: with adaptive weighting. IEEE Transactions on Cognitive
Trust management in service provision networks. In Communications and Networking, 2021.
Proceedings of the 10th IEEE International Conference
on Services Computing (SCC’13), pp. 272–279, Santa Wu, W., He, L., Lin, W., and Mao, R. Accelerating federated
Clara, CA, June 27-July 2 2013. learning over reliability-agnostic clients in mobile edge
computing systems. IEEE Transactions on Parallel and
Su, Z., Liu, L., Li, M., Fan, X., and Zhou, Y. Reliable and Distributed Systems (TPDS), 32(7):1539–1551, 2020.
resilient trust management in distributed service provision
networks. ACM Transactions on the Web (TWEB), 9(3): Wu, Y. and He, K. Group normalization. Int. J. Comput.
1–37, 2015. Vis., 128(3):742–755, 2020.
Sutskever, I., Martens, J., Dahl, G. E., and Hinton, G. E. On Xia, S., Zhu, J., Yang, Y., Zhou, Y., Shi, Y., and Chen, W.
the importance of initialization and momentum in deep Fast convergence algorithm for analog federated learning.
learning. In Proceedings of the 30th International Con- In IEEE Int. Conf. on Communications (ICC), pp. 1–6,
ference on Machine Learning, ICML 2013, Atlanta, GA, 2021.
USA, 16-21 June 2013, volume 28 of JMLR Workshop
and Conference Proceedings, pp. 1139–1147, 2013. Xie, C., Koyejo, O., Gupta, I., and Lin, H. Local adaalter:
Communication-efficient stochastic gradient descent with
TensorFlow. Tensorflow federated stack overflow dataset. adaptive learning rates. CoRR, abs/1911.09030, 2019.
https://www.tensorflow.org/federated/
api_docs/python/tff/simulation/ Yan, D., Qu, W., Guo, G., Wang, X., and Zhou, Y. Prefixfpm:
datasets/stackoverflow/load_data, 2019. A parallel framework for general-purpose mining of fre-
quent and closed patterns. The VLDB Journal (VLDBJ),
Triastcyn, A., Reisser, M., and Louizos, C. Dp-rec: Pri- 31(2):253–286, 2022a.
vate & communication-efficient federated learning. arXiv
preprint arXiv:2111.05454, 2021. Yan, D., Zhou, Y., Guo, G., and Liu, H. Parallel graph
processing. In Sherif Sakr, A. Y. Z. and Taheri, J. (eds.),
Wang, H.-P., Stich, S. U., He, Y., and Fritz, M. Progfed: Encyclopedia of Big Data Technologies. Springer, 2022b.
Effective, communication, and computation efficient fed-
erated learning by progressive training. arXiv preprint Yang, H. H-fl: A hierarchical communication-efficient and
arXiv:2110.05323, 2021a. privacy-protected architecture for federated learning. In
Int. Joint Conf. on Artificial Intelligence (IJCAI), pp. 479–
Wang, J., Tantia, V., Ballas, N., and Rabbat, M. Slowmo: 485, 2021.
Improving communication-efficient distributed sgd with
slow momentum. In Int. Conf. on Learning Representa- Yapp, A. Z. H., Koh, H. S. N., Lai, Y. T., Kang, J., Li, X., Ng,
tions (ICLR), pp. 1–27, 2020. J. S., Jiang, H., Lim, W. Y. B., Xiong, Z., and Niyato1,
D. Communication-efficient and scalable decentralized
Wang, J., Xu, Z., Garrett, Z., Charles, Z., Liu, L., and Joshi, federated edge learning. In Int. Joint Conf. on Artificial
G. Local adaptivity in federated learning: Convergence Intelligence (IJCAI), pp. 5032–5035, 2021.
and consistency. In The International Workshop on Feder-
ated Learning for User Privacy and Data Confidentiality Yuan, H., Morningstar, W., Ning, L., and Singhal, K. What
in Conjunction with ICML 2021, (FL-ICML’21), Online, do we mean by generalization in federated learning?
December 6-14 2021b. arXiv preprint arXiv:2110.14216, 2021a.
Yuan, H., Zaheer, M., and Reddi, S. Federated composite Zhou, Y. and Liu, L. Approximate deep network embedding
optimization. In Int. Conf. on Machine Learning (ICML), for mining large-scale graphs. In Proceedings of the 2019
volume 139, pp. 12253–12266, 2021b. IEEE International Conference on Cognitive Machine
Intelligence (CogMI’19), pp. 53–60, Los Angeles, CA,
Yun, C., Rajput, S., and Sra, S. Minibatch vs local sgd with
December 12-14 2019.
shuffling: Tight convergence bounds and beyond. arXiv
preprint arXiv:2110.10342, 2021. Zhou, Y., Liu, L., Lee, K., Pu, C., and Zhang, Q. Fast itera-
Zhang, H., Liu, J., Jia, J., Zhou, Y., Dai, H., and Dou, tive graph computation with resource aware graph parallel
D. Fedduap: Federated learning with dynamic update abstractions. In Proceedings of the 24th ACM Symposium
and adaptive pruning using shared data on the server. In on High-Performance Parallel and Distributed Comput-
Proceedings of the 31st International Joint Conference on ing (HPDC’15), pp. 179–190, Portland, OR, June 15-19
Artificial Intelligence (IJCAI’22), Messe Wien, Vienna, 2015a.
Austria, July 23-29 2022. Zhou, Y., Liu, L., Lee, K., and Zhang, Q. Graphtwist: Fast
Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, iterative graph computation with two-tier optimizations.
S. J., Kumar, S., and Sra, S. Why ADAM beats SGD for Proceedings of the VLDB Endowment (PVLDB), 8(11):
attention models. CoRR, abs/1912.03194, 2019. 1262–1273, 2015b.
Zhang, M., Sapra, K., Fidler, S., Yeung, S., and Alvarez, Zhou, Y., Amimeur, A., Jiang, C., Dou, D., Jin, R., and
J. M. Personalized federated learning with first order Wang, P. Density-aware local siamese autoencoder net-
model optimization. In Int. Conf. on Learning Represen- work embedding with autoencoder graph clustering. In
tations (ICLR), 2021. Proceedings of the 2018 IEEE International Conference
on Big Data (BigData’18), pp. 1162–1167, Seattle, WA,
Zhang, Q., Liu, L., Ren, Y., Lee, K., Tang, Y., Zhao, X., and December 10-13 2018a.
Zhou, Y. Residency aware inter-vm communication in vir-
tualized cloud: Performance measurement and analysis. Zhou, Y., Wu, S., Jiang, C., Zhang, Z., Dou, D., Jin, R.,
In Proceedings of the 2013 IEEE International Confer- and Wang, P. Density-adaptive local edge representation
ence on Cloud Computing (CLOUD’13), pp. 204–211, learning with generative adversarial network multi-label
Santa Clara, CA, June 27-July 2 2013. edge classification. In Proceedings of the 18th IEEE
International Conference on Data Mining (ICDM’18),
Zhang, Q., Liu, L., Lee, K., Zhou, Y., Singh, A., Mandagere,
pp. 1464–1469, Singapore, November 17-20 2018b.
N., Gopisetty, S., and Alatorre, G. Improving hadoop
service provisioning in a geographically distributed cloud. Zhou, Y., Ye, Q., and Lv, J. Communication-efficient fed-
In Proceedings of the 2014 IEEE International Confer- erated learning with compensated overlap-fedavg. IEEE
ence on Cloud Computing (CLOUD’14), pp. 432–439, Transactions on Parallel and Distributed Systems (TPDS),
Anchorage, AK, June 27-July 2 2014. 33(1):192–205, 2022c.
Zhang, Z., Jin, J., Zhang, Z., Zhou, Y., Zhao, X., Ren, J.,
Liu, J., Wu, L., Jin, R., and Dou, D. Validating the
lottery ticket hypothesis with inertial manifold theory. In
Advances in Neural Information Processing Systems 34:
Annual Conference on Neural Information Processing
Systems 2021 (NeurIPS’21), Virtual.
Zhou, C., Liu, J., Jia, J., Zhou, J., Zhou, Y., Dai, H., and Dou,
D. Efficient device scheduling with multi-job federated
learning. In Proceedings of the 36th AAAI Conference
on Artificial Intelligence (AAAI’22), Vancouver, Canada,
February 22-March 1 2022a.
Zhou, C., Liu, J., Jia, J., Zhou, J., Zhou, Y., Dai, H., and Dou,
D. Efficient device scheduling with multi-job federated
learning. In AAAI Conf. on Artificial Intelligence, pp.
1–14, 2022b.
Zhou, Y. Innovative Mining, Processing, and Application of
Big Graphs. PhD thesis, Georgia Institute of Technology,
Atlanta, GA, USA, 2017.
A. Appendix
A.1. Related Work
Parallel, distributed, and federated learning have attracted active research in the last decade (Zhang et al., 2013; Lee et al.,
2013; Su et al., 2013; Palanisamy et al., 2014; Zhang et al., 2014; Su et al., 2015; Zhou et al., 2015a; Bao et al., 2015; Zhou
et al., 2015b; Lee et al., 2015; Jiang et al., 2016; Zhou, 2017; Zhou et al., 2018b;a; Palanisamy et al., 2018; Lee et al., 2019;
Zhou & Liu, 2019; Goswami et al., 2020; Zhang et al.; Zhou et al., 2022a; Yan et al., 2022b;a; Liu et al., 2022; Guimu Guo
& Zhou, 2022; Zhang et al., 2022). While it achieves much advancement in federated learning, FedAvg (McMahan et al.,
2017b) does not consider the adaptive optimization when aggregating the weights or gradients from devices, which makes
the training process inefficient and makes it difficult to tune or to achieve desired accuracy. SCAFFOLD (Karimireddy et al.,
2020c) exploits control variate to reduce the client drift problem, i.e., “over-fitting” to local device data, in order to reduce
the influence of the heterogeneity of non-IID data and to improve the performance (accuracy). The control parameters may
result in stateful devices (Karimireddy et al., 2020c; Acar et al., 2021), which is not compatible with cross-device Federated
Learning (FL) because of full participation of devices. The cross-device FL simultaneously deals with large amounts of
devices while each round only samples a fraction of the available devices for the training process in order to mitigate the
straggler problem (Kairouz et al., 2019; Liu et al., 2021). Although the contribution of the devices can be exploited to
dynamically update the global model (Wu & Wang, 2021), the contribution-based dynamic update cannot well address the
client-drift problem. The contribution can calculated based on the angle between the local gradient and the global gradient.
Adaptive optimization can be applied either at the server side to improve the performance (Reddi et al., 2021b) and to reduce
the communication and computation costs (Mills et al., 2021), using multiple adaptive optimization method, e.g., AdaGrad
(Duchi et al., 2011a), Yogi (Reddi et al., 2018), Adam (Kingma & Ba, 2014; Mills et al., 2019) and momentum (Rothchild
et al., 2020; Mills et al., 2021), or at the client side (Yuan et al., 2021b) to reduce the number of rounds (Mills et al., 2019)
utilizing momentum (Liu et al., 2020b; Gao et al., 2021; Wang et al., 2020), Adam (Mills et al., 2019). In contrast, the
single side application of existing adaptive optimization, i.e., the server side or the device side, may lead to insufficient
performance, due to incomplete adaptation of the adaptive optimization methods. While correction techniques can be
utilized to mitigate the bias of the convergence point for the application of AdaGrad at the device side (Wang et al., 2021c),
the application at both the server side and the device should be addressed at the same time to achieve better performance.
In addition, the application of momentum at the device side (Liu et al., 2020b) may incur severe client-drift because of
heterogeneity of non-IID data and law participation rate of devices. Moreover, the synchronization of the global momentum
parameters may incur extra communication costs (Wang et al., 2020). Momentum (Karimireddy et al., 2020b; Ozfatura et al.,
2021; Khanduri et al., 2021) and Adam (Leroy et al., 2019) can be utilized on both the server and the devices to achieve fast
training speed and high convergence accuracy, while the simple application may correspond to limited improvement.
The adaptive optimization methods can be combined with compression mechanism (Mills et al., 2019; Gao et al., 2021;
Rothchild et al., 2020; Li et al., 2020) to further reduce the communication costs. In addition, device selection methods
are utilized to achieve faster convergence (Xia et al., 2021) and higher accuracy (Chen et al., 2020), even with multiple
jobs (Zhou et al., 2022b). Furthermore, overlapping the local training process and the data transfer can improve the
communication efficiency (Zhou et al., 2022c). Sparsification (Mitra et al., 2021b) and quantization (Anonymous, 2022c)
can be exploited in FL to improve the communication efficiency, wherein the update at the device side relies on the gradients
of the last round of all the devices and more communication is required to configure the sparsification parameters (Mitra
et al., 2021b). Moreover, the combination of gossip protocol-based decentralized model aggregation, which is performed at
devices, and the centralized model aggregation, which is performed at the server, helps to speed up the training process of
FL (Anonymous, 2022d). Leading principal components (Anonymous, 2022g), low-rank hadamard product (Hyeon-Woo
et al., 2021), and progressive training (Wang et al., 2021a), i.e., the model progressively grows along with the training
process, can be exploited as well to reduce the communication costs and improve the performance. Some other methods,
e.g., communication-efficient ensemble algorithms (Hamer et al., 2020b), hierarchical architecture (Yang, 2021; Wu et al.,
2020), blockchain-based mechanism (Yapp et al., 2021), federated optimization (Pathak & Wainwright, 2020; Dai et al.,
2020), dual averaging procedure for non-smooth regularizer (Yuan et al., 2021b), agnostic constrained learning formulation
for class imbalance problems (Anonymous, 2022b), separation of unseen client data and unseen client distributions (Yuan
et al., 2021a), knowledge distillation for heterogeneous model structures (Afonin & Karimireddy, 2021), minibatch and local
random reshuffling (Yun et al., 2021), unlabeled data transformation for unsupervised LF (Anonymous, 2022h), multi-task
FL for image processing (Anonymous, 2022f), and personalization (Zhang et al., 2021; Oh et al., 2021; Hyeon-Woo et al.,
2021) or the trade-off between personalization and generalization (Chen & Chao, 2021), are proposed to further improve the
performance of FL. However, the above works are orthogonal to our approach and out of the scope of this paper.
As private data may be recovered with small malicious modifications of the shared model (Fowl et al., 2021) or sensitive
information may still be leaked from the gradients (Balunović et al., 2021), differential privacy (Anonymous, 2022e;
Triastcyn et al., 2021) and compression techniques (Triastcyn et al., 2021) are combined to improve the privacy and
communication efficiency of FL, while the encoding of gradient knowledge with the regularization of locally trained
parameters (Anonymous, 2022a) helps improve the performance and robustness at the same time. While gradient flips
are indicative of attacks, reputation-based weighted aggregation can help to improve the robustness of FL (Sharma et al.,
2021). Batch-normalization (Hong et al., 2021) and Bayes optimal method (Balunović et al., 2021) can be exploited to deal
with adversarial training attack, while bucketing methods can alleviate the impact of byzantine attacks (Karimireddy et al.,
2020a). These methods can be combined with our approach to improve the robustness and the privacy of FL.
This work presents a theoretical principle on where to and how to design and utilize adaptive optimization methods in
federated settings. We theoretically analyze the connection between federated optimization methods and decompositions
of ODEs of corresponding centralized optimizers. We develop a momentum decoupling adaptive optimization method to
make full use of the strength of fast convergence and high accuracy of the global momentum. We also utilize the full batch
gradients to mimic centralized optimization for ensuring the convergence and overcoming the possible inconsistency caused
by adaptive optimization methods.
A.2. Momentum Decoupling Adaptive Optimization: FedDA+Adam and FedDA+AdaGrad

Recall the centralized Adam optimizer.
m(t + 1) = β1 ∗ m(t) + (1 − β1 ) ∗ g(W (t)),

v(t + 1) = β2 ∗ v(t) + (1 − β2 ) ∗ (g(W (t))2 ,
m̂(t + 1) = m(t + 1)/(1 − β1t ), (30)
v̂(t + 1) = v(t + 1)/(1 − β2t ),
p
W (t + 1) = W (t) − m̂(t + 1)/( v̂(t + 1) + ) ∗ η.
Ni i
PM
where g(W ) = i=1 N g (W ). The Adam optimizer is the numerical solution to the ODE system
d
η m(τ ) = −(1 − β1 )m(τ ) + (1 − β1 ) ∗ g(W (τ )),
dτ
d
η v(τ ) = −(1 − β2 )v(τ ) + (1 − β2 ) ∗ (g(W (τ ))2 ,
dτ
m̂(τ ) = m(τ )/(1 − β1τ ), (31)
v̂(τ ) = v(τ )/(1 − β2τ ),
d p
W (τ ) = −m̂(τ )/( v̂(τ ) + ) ∗ η.
dτ
The equilibria of the above is system are (m, v, W ) = (0, 0, W ∗ ), where W ∗ is local minimum point of the loss function,
i.e., g(W ∗ ) = 0. Therefore, the Adam optimizer trains the model to converge to local minimum point of the loss function.
Applying our decoupling method, we first decouple the global momentum and velocity m, v with local training, i.e.,
d i
W (τ ) = −g i (W i (τ )),
dτ
d
η m(τ ) = −(1 − β1 )m(τ ) + (1 − β1 ) ∗ g(W (τ )), (32)
dτ
d
η v(τ ) = −(1 − β2 )v(τ ) + (1 − β2 ) ∗ (g(W (τ ))2 .
dτ
Similar to the SGDM case, the equation of m(τ ) is totally linear and can be decomposed precisely as
d i
η m (τ ) = −(1 − β1 )mi (τ ) + (1 − β1 )g i (W i (τ )). (33)
dτ
Unlike the equation of m which is linear, the equation of v is nonlinear due to the presence of the nonlinear term
PM N i i 2
(g(W ))2 = i=1 N g (W ) . It is apparent that
M M
X Ni 2 X Ni i
g i (W ) 6= (g (W ))2 .
i=1
N i=1
N
PM i
Therefore, it is not possible to decompose the equation of v precisely. Moreover, i=1 NN (g i (W ))2 is not even a close
PM N i i 2
approximation of i=1 N g (W ) . An immediate counter example is that (1 − 1)2 6= 12 + (−1)2 . For centralized Adam,
√
the expectation of each component of m̂/ v̂ is approximately ±1 , which is a crucial point of Adam optimizer. Thus, when
attempting to generalize centralized Adam optimizer to FL, the key is that m, v must be highly synchronized as in centralized
Adam. There are two possible resolutions. First, as in previous arguments of precise decomposition, by reducing local
iteration number to 1, one obtains a precise decomposition of Eq.(31). Consequently, the FL training matches centralized
training. However, the communication cost is too expensive to be affordable for this approach. Another possible way is
to utilize Adam optimizer on server, i.e., local models are trained with SGDM and then server aggregates total gradients
uploaded by local devices and input the aggregation into a global Adam. This is the same as FedOpt method. However,
on second thought, though v cannot be precisely decomposed, the global momentum m can be decomposed precisely as
mentioned before and it still can be utilized fully. Therefore, we propose our FedDA+Adam method as follows. At round E,
pick participating clients S 1 , · · · , S k . For each S i , i = 1, · · · , K, initialize P i = 0, W i (0) = W (E), mi (0) = m(E). For
t = 0, T − 1,
W i (t + 1) = W i (t) − g i (W i (t)) ∗ η
mi (t + 1) = β1 mi (t) + (1 − β1 )g i (W i (t)) (34)
i i i
P = P + m (t + 1)
The global update rule is:
P = aggregation of P i ,
m(E + 1) = aggregation of mi (T )

G = P − β1 ∗ m(E) /(1 − β1 )
m̂(E + 1) = β1 ∗ m(E) + (1 − β1 ) ∗ G /(1 − β1E )

(35)
V (E + 1) = β2 ∗ V (E) + (1 − β2 ) ∗ G 2
V̂ (E + 1) = V (E)/(1 − β2E )
q
W (E + 1) = W (E) − m̂(E + 1)/( V̂ (E + 1) + ) ∗ η ∗ α
with a global gradient G, i.e., P = β1 + (1 − β1 ) ∗ G.

The idea is to treat P as the momentum of the global Adam updated
Simple algebra gives the global gradient G = P − β1 ∗ m(E) /(1 − β1 ). Then the global velocity v is updated with G
followed by the update of W .
Recall the AdaGrad optimizer
V (t) = V (t − 1) + g(W (t))2 ,
p (36)
W (t + 1) = W (t) − g(W (t))/ V (t) + ∗ η.
The corresponding ODE system is
d
η∗ V (τ ) = g(W (τ ))2 ,
dτ (37)
d p
W (τ ) = −g(W (τ ))/ V (τ ) + .
dτ
PM N i i 2
For the same reason, i.e., the nonlinearity of i=1 N g (W ) , AdaGrad optimizer is not suitable for decomposing onto
local devices as well. As in FedDA+Adam, we propose to utilize the decoupled global momentum to generate a global
gradient G and update the global AdaGrad optimizer with input G.
W i (t + 1) = W i (t) − g i (W i (t)) ∗ η
mi (t + 1) = β1 mi (t) + (1 − β1 )g i (W i (t)) (38)
i i i
P = P + m (t + 1)
The global update rule is:
P = aggregation of P i ,
m(E + 1) = aggregation of mi (T )

G = P − β1 ∗ m(E) /(1 − β1 ) (39)
2
V (E + 1) = V (E) + G
q
W (E + 1) = W (E) − G/( V̂ (E + 1) + ) ∗ η
By assembling all the pieces in Sections 4 and A.2, we provide the pseudo code of our FedDA algorithm with the adaptive
optimization of SGDM, Adam, and AdaGrad in Algorithms 1 and 2 respectively.
Algorithm 1 FedDA+SGDM
Input: W0 , m0
1: for r = 0, · · · , R − 1 do
2: Sample a subset S of clients
3: for each client S i ∈ S in parallel do
i i
4: Pr,0 = 0, Wr,0 = Wr , mir,0 = mr
5: for t = 0, · · · , T − 1 do
i i i i
6: Wr,t+1 = Wr,t − gr,t (Wr,t )∗η
7: mir,t+1 = β ∗ mir,t + (1 − β) ∗ gr,t i i
(Wr,t )
i i i
8: Pr,t+1 = Pr,t + mr,t+1
i
9: Pr = aggregation of Pr,T
10: mr+1 = aggregation of mir,T
11: Wr+1 = Wr − Pr ∗ η ∗ α
Algorithm 2 FedDA+ADAM& FedDA+AdaGrad

Input: W0 , m0 , V0
1: for r = 0, · · · , R − 1 do
2: Sample a subset S of clients
3: for each client S i ∈ S in parallel do
i i
4: Pr,0 = 0, Wr,0 = Wr , mir,0 = mr
5: for t = 0, · · · , T − 1 do
i i i i
6: Wr,t+1 = Wr,t − gr,t (Wr,t )∗η
i i i i
7: mr,t+1 = β1 ∗ mr,t + (1 − β1 ) ∗ gr,t (Wr,t )
i i
8: Pr,t+1 = Pr,t + mir,t+1
i
9: Pr = aggregation of Pr,T
10: mr+1 = aggregation of mir,T
11: Gr = (Pr − β1 ∗ mr )/(1 − β1 )
12: ADAM
13: m̂r+1 = (β1 ∗ mr + (1 − β1 ) ∗ Gr )/(1 − β1r )
14: Vr+1 = (β2 ∗ Vr + (1 − β2 ) ∗ Gr2 )
15: V̂r+1 = Vr+1 /(1 − β2r ) √
16: Wr+1 = Wr − m̂r+1 /( Vr+1 + ) ∗ η ∗ α
17: AdaGrad
18: Vr+1 = Vr + Gr2
√
19: Wr+1 = Wr − Gr /( Vr+1 + ) ∗ η ∗ α
A.3. Advantage of Global Momentum in FedDA

Assume that
1. (Lipschitz Gradient). There exists a constant Lg such that kg i (W1 ) − g i (W2 )k ≤ Lg kW1 − W2 k for any W1 , W2 and
i = 1, · · · , m.
2. (Bounded Gradient) kg i kL∞ < ∞ for any i = 1, 2, · · · , m.
Under these assumptions, we theoretically demonstrate that (1) local momentum deviates from the centralized one at
exponential rate O(eλt ); and (2) global momentum in FedDA deviates from the centralized one at algebraic rate O(t2 ).
Local momentum deviates from the centralized one at exponential rate. Let (mi (τ ), W i (τ )) be the solution to the
decomposed system
d i
η m (τ ) = −(1 − β)mi (τ ) + (1 − β)g i (W i (τ )),
dτ (40)
d i
W (τ ) = −mi (τ ).
dτ
Let (m(τ ), W (τ )) be the solution to Eq.(19) with (m(0), W (0)) = (mi (0), W i (0)), i.e., centralized SGDm optimizer with
the same initialization as the decomposed optimizer. By direct calculations, we have
d
(mi (τ ) − m(τ )) = (1 − β) − (mi (τ ) − m(τ )) + g i (W i (τ )) − g i (W (τ )) + R(τ ) ,

η
dτ (41)
d
(W i (τ ) − W (τ )) = −(mi (τ ) − m(τ )),
dτ
N x
where R(τ ) = j6=i Nj g j (W (τ )) − g i (W (τ )) . One can check that ∇kxk = kxk for nay x ∈ Rn . Therefore, it holds
P
that k∇kxkk ≤ 1 and x · ∇kxk = kxk, where · means dot product. Taking the inner product of the first equation in Eq.(41)
with ∇kmi (τ ) − m(τ )k, we obtain
d
η kmi (τ ) − m(τ )k ≤ (1 − β) − kmi (τ ) − m(τ )k + Lg kW i (τ ) − W (τ )k + kR(τ ) . (42)
dτ
Similarly, take the inner produce of the second equation in Eq.(41) with ∇kW i − W k, we obtain
d
kW i (τ ) − W (τ )k ≤ kmi (τ ) − m(τ )k (43)
dτ
The system of Eq.(42) and Eq.(43) can be written as the following matrix form.
kmi (τ ) − m(τ )k kmi (τ ) − m(τ )k

d kR(τ )k/η
≤A + (1 − β) (44)
dτ kW i (τ ) − W (τ )k kW i (τ ) − W (τ )k 0

−(1 − β)/η (1 − β)Lg /η
where A = . By direct computations, the eigenvalue of the matrix A is λ± =
1 0
√
−(1−β)± (1−β)2 +4(1−β)Lg +
2η . Therefore keAt k ≤ CA eλ t for any t ≥ 0 for some constant CA . Note that λ+ > 0.
Applying variation of constants formula to Eq.(44), one has
Z tη
kmi (tη) − m(tη)k kmi (0) − m(0)k

Atη A(tη−τ ) kR(τ )k/η
≤e + (1 − β) e dτ. (45)
kW i (tη) − W (tη)k kW i (0) − W (0)k 0 0
Note that mi (0) − m(0) = 0 and W i (0) − W (0). Therefore, for any t, it holds
Z tη
kmi (tη) − m(tη)k

A(tη−τ ) kR(τ )k/η
≤ (1 − β) e dτ. (46)
kW i (tη) − W (tη)k 0 0
+
Utilizing keAt k ≤ CA eλ t
for t ≥ 0 in the above inequality , we obtain
kmi (tη) − m(tη)k + kW i (tη) − W (tη)k

Z tη Z tη (47)
+ + +
eλ (tη−τ ) kR(τ )k/ηdτ = eλ tη (1 − β) e−λ τ

≤ kR(τ )k/η dτ,
0 0
which implies that kmi (tη) − m(tη)k is exponentially growing. Essentially, the term g i (W i (τ )) − g i (W (τ )) in Eq. (41)
causes such exponential growth. More precisely, since g i (W i ) − g i (W ) = ∇g i (W )(W i − W ) + O(|W i − W |2 ), it
contributes a linear term ∇g i (W )(W i − W ) to Eq.(41). Consequently, the matrix A has a positive eigenvalue, which
implies that keAt k is exponentially growing. Thus, the difference of momentum kmi (tη) − m(tη)k is exponentially
growing. However, in our FedDA method, since momentum is decoupled with local training,there is no such a term as
i i i −(1 − β)/η 0
g (W ) − g (W ). Therefore, in our FedDA framework, the matrix analogous to matrix A is , whose
1 0
eigenvalues are 0, −(1 − β). Therefore, the momentum deviation in our FedDA method is only algebraic growing.
Global momentum in FedDA deviates from the centralized one at algebraic rate. Let (W̄ i (τ ), m̄(τ ), W̄ (τ )) be the
solution to Eq.(22) with α = 1. Let (m(τ ), W (τ )) be the solution to Eq.(19) with (m(0), W (0)) = (m̄(0), W̄ (0)). It is
straightforward to compute
M
d X Ni
η (m̄(τ ) − m(τ )) = −(1 − β) (m̄(τ ) − m(τ )) + g i (W̄ i (τ )) − g i (W (τ ))
dτ N
i (48)
d
(W̄ (τ ) − W (τ )) = −(m̄(τ ) − m(τ )),
dτ
By calculations similar to the above ones, we first have
PM Ni

d km̄(τ ) − m(τ )k km̄(τ ) − m(τ )k k i g i (W̄ i (τ )) − g i (W (τ )) k/η
≤ Ā + (1 − β) N (49)
dτ kW̄ (τ ) − W (τ )k kW̄ (τ ) − W (τ )k 0

−(1 − β)/η 0
where Ā = . One can check that the eigenvalues of Ā are 0, −(1 − β)/η, which implies that there exists
1 0
a constant CĀ such that keĀt k ≤ CĀ for any t ≥ 0, i.e., keĀt k is uniformly bounded without growth. Clearly,
kg i (W̄ i (τ )) − g i (W (τ ))k ≤ Lg kW̄ i (τ ) − W (τ )k. (50)
Moreover, utilizing W̄ i (0) = W (0), we have

Z τ
i
W̄ (τ ) − W (τ ) = −g i (W̄ i (s)) + m(s)ds. (51)
0
It follows that
kW̄ i (τ ) − W (τ )k ≤ kg i kL∞ + supkm(s)k τ.

(52)
s≤τ
By variation of constants formula, we have

Z τ
1
km(τ )k ≤ e−(1−β)τ /η km(0)k + e−(1−β)(τ −s)/η (1 − β)kg(W (s))kds, (53)
η 0
which implies that |m(τ )| ≤ |m(0)| + kgkL∞ for any τ ≥ 0. Therefore, we have
kW̄ i (τ ) − W (τ )k = kg i kL∞ + km(0)k + kgkL∞ )τ. (54)
Combining Eq.(50) and Eq.(54), we have
kg i (W̄ i (τ )) − g i (W (τ ))k ≤ k∇g i kL∞ kg i kL∞ + km(0)k + kgkL∞ )τ. (55)

Applying the variation of constants formula to Eq.(49), we have
Z tη PM Ni

km̄(tη) − m(tη)k k i g i (W̄ i (τ )) − g i (W (τ )) k/η
≤ eĀ(tη−τ ) N dτ. (56)
kW̄ (tη) − W (tη)k 0 0
Utilizing keĀt k ≤ CĀ and Eq.(55), it holds
tη
C ∗η 2
Z
km̄(tη) − m(tη)| + kW̄ (tη) − W (tη)k ≤ C ∗ τ /ηdτ = t , (57)
0 2
where C ∗ = supk∇g i kL∞ kg i kL∞ + km(0)k + kgkL∞ ). Therefore, the momentum in our FedDA method deviates from
i
the centralized one in algebraic rate O(t2 ), which is much slower than that of local momentum.
A.4. Additional Experiments

In this section, we conduct more experiments to validate the accuracy and convergence of our proposed FedDAmethod and
evaluate the sensitivity of client and server learning rates in our momentum decoupling adaptive optimization method for the
FL task.
Convergence and loss over Stack Overflow. Figures 7-8 present the Accuracy and Loss curves of ten federated learning
algorithms on Stack Overflow. Similar trends are observed for the performance comparison: FedDA achieves the 75.4%
convergence improvement, which are obviously better than other methods in most experiments. This demonstrates that the
full batch gradient techniques that mimic centralized optimization in the end of the training process are able to ensure the
convergence and overcome the possible inconsistency caused by adaptive optimization methods.
0.7 0.7
FedLocal FedLocal FedLocal
0.4 SCAFFOLD SCAFFOLD SCAFFOLD
FedLin 0.6 FedLin 0.6 FedLin
FedAvg FedAvg FedAvg
MFL 0.5 MFL MFL
CLIMB CLIMB 0.5 CLIMB
0.3
STEM STEM STEM
MimeLite 0.4 MimeLite 0.4 MimeLite
FedOpt FedOpt FedOpt
0.2 FedDA 0.3 FedDA 0.3 FedDA
0.2 0.2
0.1
0.1 0.1
0.0 0.0 0.0

0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400
(a) SGDM (b) Adam (c) AdaGrad
Figure 7: Convergence on Stack Overflow with Three Optimizers

80 FedLocal 70 FedLocal 70 FedLocal

SCAFFOLD SCAFFOLD SCAFFOLD
70 FedLin 60 FedLin 60 FedLin
FedAvg FedAvg FedAvg
60 MFL MFL MFL
CLIMB 50 CLIMB 50 CLIMB
50 STEM STEM STEM
MimeLite 40 MimeLite 40 MimeLite
40 FedOpt FedOpt FedOpt
FedDA FedDA FedDA
30 30
30
20 20
20
10 10 10
0 0 0
0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400
(a) SGDM (b) Adam (c) AdaGrad
Figure 8: Loss on Stack Overflow with Three Optimizers
Final Accuracy over Stack Overflow. Table 5 presents the final Accuracy scores of ten federated learning algorithms
on Stack Overflow. We have observed similar trends: the accuracy achieved by our FedDA is the highest in most tests.
Especially, as shown the experiment with SGDM as the optimizer, compared to the best competitors among ten federated
learning algorithms, the final Accuracy scores achieved by FedDA averagely achieves 22.3% improvement. A rational
guess is that the global momentum in our FedDA method makes the best effort to mimic the role of momentum in centralized
training, which can accelerate the convergence of FL training.
Table 5: Final Accuracy on Stack Overflow
Optimizer SGDM Adam AdaGrad

FedLocal 0.152 0.576 0.229
SCAFFOLD 0.250 0.250 0.250
FedLin 0.224 0.224 0.224
FedAvg 0.252 0.252 0.252
MFL 0.134 0.101 0.215
CLIMB 0.302 0.302 0.302
STEM 0.196 0.196 0.196
MimeLite 0.271 0.211 0.078
FedOpt 0.225 0.642 0.691
FedDA 0.273 0.642 0.674
Impact of client and server learning rates. Tables 6-11 report how the test Accuracy changes with server and client
learning rates on three datasets by fixing the server learning rates and changing the client learning rates, or by utilizing the
reverse settings. We have observed that the Accuracy scores oscillate within the range of 0.002 and 0.868 when changing
the client learning rates, while the Accuracy values fluctuate between 0.010 and 0.861. This demonstrates that it is crucial
to choose the optimal learning rates for the training on the clients and server to achieve the competitive performance. Please
refer to Table 12 for the implementation details of the server and client learning rates used in our current experiments.
Table 6: Final Accuracy with SGDM Optimizer and Varying Client Learning Rate
Dataset Server Learning Rate Accuracy / Client Learning Rate

CIFAR-100 0.1 0.521 / 1 0.448 / 3.3 0.236 / 10 0.099 / 33
EMNIST 0.1 0.778 / 1 0.819 / 3.3 0.783 / 10 0.052 / 33
Stack Overflow 0.1 0.174 / 1 0.262 / 10 0.250 / 100 0.002 / 1,000
Table 7: Final Accuracy with Adam Optimizer and Varying Client Learning Rate

CIFAR-100 0.1 0.179 / 1 0.317 / 3.3 0.269 / 10 0.179 / 33
EMNIST 0.03 0.488 / 0.01 0.836 / 0.1 0.842 / 0.3 0.804 / 1
Stack Overflow 0.3 0.409 / 0.1 0.492 / 1 0.473 / 10 0.459 / 100
Table 8: Final Accuracy with AdaGrad Optimizer and Varying Client Learning Rate

CIFAR-100 0.1 0.462 / 0.03 0.488 / 0.1 0.410 / 0.3 0.259 / 1
EMNIST 0.1 0.055 / 0.00001 0.055 / 0.001 0.055 / 0.01 0.868 / 0.1
Stack Overflow 10 0.576 /1 0.651 / 10 0.629 / 30 0.632 / 100
Table 9: Final Accuracy with SGDM Optimizer and Varying Server Learning Rate
Dataset Client Learning Rate Accuracy / Server Learning Rate

CIFAR-100 0.03 0.374 / 1 0.522 / 3.3 0.356 / 10 0.273 / 33
EMNIST 0.1 0.859 / 0.3 0.861 / 1 0.778 / 3.3 0.783 / 10
Stack Overflow 100 0.191 / 0.001 0.256 / 0.003 0.204 / 0.01 0.292 / 0.1
Table 10: Final Accuracy with Adam Optimizer and Varying Server Learning Rate

CIFAR-100 0.03 0.510 / 0.33 0.183 / 3.3 0.010 / 10 0.010 / 33
EMNIST 0.03 0.546 / 0.1 0.803 / 1 0.051 / 3.3 0.051 / 10
Stack Overflow 100 0.425 / 0.1 0.522 / 0.3 0.641 / 1 0.633 / 10
Table 11: Final Accuracy with AdaGrad Optimizer and Varying Server Learning Rate

CIFAR-100 0.03 0.352 / 0.03 0.462 / 0.1 0.466 / 0.3 0.312 / 1
EMNIST 0.03 0.806 / 0.1 0.055 / 3.3 0.055 / 10 0.055 / 33
Stack Overflow 100 0.227 / 0.1 0.306 / 0.3 0.397 / 1 0.632 / 10
A.5. Experimental Details

Environment. Our experiments were conducted on a compute server running on Red Hat Enterprise Linux 7.2 with 2 CPUs
of Intel Xeon E5-2650 v4 (at 2.66 GHz) and 8 GPUs of NVIDIA GeForce GTX 2080 Ti (with 11GB of GDDR6 on a 352-bit
memory bus and memory bandwidth in the neighborhood of 620GB/s), 256GB of RAM, and 1TB of HDD. Overall, our
experiments took about 2 days in a shared resource setting. We expect that a consumer-grade single-GPU machine (e.g.,
with a 1080 Ti GPU) could complete our full set of experiments in around tens of hours, if its full resources were dedicated.
All the codes were implemented based on the Tensorflow Federated (TFF) package (Ingerman & Ostrowski, 2019). Clients
are sampled uniformly at random, without replacement in a given round, but with replacement across rounds. Our
implementation follows the same settings in the approaches of FedAvg (McMahan et al., 2017a) and FedOpt (Reddi et al.,
2021a). First, instead of doing K training steps per client, we do E epochs of training over each client’s dataset. Second, to
account for varying numbers of gradient steps per client, we weight the average of the client outputs by each client’s number
of training samples. Since the datasets used are all public datasets and the hyperparameter settings are explicitly described,
our experiments can be easily reproduced on top of a GPU server. We promise to release our open-source codes on GitHub
and maintain a project website with detailed documentation for long-term access by other researchers and end-users after
the paper is accepted.
Datasets. We following the same strategy in FedOpt (Reddi et al., 2021a) to create a federated version of CIFAR-100
by randomly partitioning the training data among 500 clients, with each client receiving 100 examples. We randomly
partition the data to reflect the coarse and fine label structure of CIFAR-100 by using the Pachinko Allocation Method (Li &
McCallum, 2006). In the derived client datasets, the label distributions better resemble practical heterogeneous client datasets.
We train a modified ResNet-18, where the batch normalization layers are replaced by group normalization layers (Wu & He,
2020). We use two groups in each group normalization layer. Group normalization can lead to significant gains in accuracy
over batch normalization in federated settings (Hsieh et al., 2020).
The federated version of EMNIST partitions the digits by their author (Caldas et al., 2018b). The dataset has natural
heterogeneity stemming from the writing style of each person. We use a convolutional network for character recognition.
The network has two convolutional layers with 3 × 3 kernels, max pooling, and dropout, followed by a 128 unit dense layer.
Stack Overflow is a language modeling dataset consisting of question and answers from the question and answer site (Ten-
sorFlow, 2019). The questions and answers also have associated metadata, including tags. The dataset contains 342,477
unique users which we use as clients. We choose the 10,000 most frequently used words, the 500 most frequent tags and
adopt a one-versus-rest classification strategy, where each question/answer is represented as a bag-of-words vector.
Implementation. For four regular federated learning models of FedAvg 1 , SCAFFOLD 2 , STEM 3 , and CLIMB 4 , we
used the open-source implementation and default parameter settings by the original authors or the Google Research for our
experiments. For three federated optimization approaches of MimeLite 5 , FedOpt 6 , and FedLocal 7 , we also utilized the
same model architecture as the official implementation provided by the Google Research and used the same datasets to
validate the performance of these federated optimization models in all experiments. For other regular federated learning
or federated optimization approaches, including MFL and FedLin, to our best knowledge, there are no publicly available
open-source implementations on the Internet. All hyperparameters are standard values from the reference works. The above
open-source codes from the GitHub are licensed under the MIT License, which only requires preservation of copyright and
license notices and includes the permissions of commercial use, modification, distribution, and private use.
For our proposed decoupled adaptive optimization algorithm, we performed hyperparameter selection by perform-
ing a parameter sweep on training rounds ∈ {1, 500, 2, 000, 2, 500, 3, 000, 3, 500, 4, 000}, momentum parameter
β1 ∈ {0.84, 0.86, 0.88, 0.9, 0.92, 0.94}, second moment parameter β2 ∈ {0.984, 0.986, 0.988, 0.99, 0.992, 0.994},
fuzz factor ∈ {0.00001, 0.0001, 0.001, 0.01, 0.1}, local iteration number ∈ {1, 2, 5, 10, 20}, and learning rate η ∈
{0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3.3, 10, 33, 100, 333, 1, 000}. The above search process is often done using validation
data in centralized settings. However, such data is often inaccessible in federated settings, especially cross-device settings.
Therefore, we tune by selecting the parameters that minimize the average training loss over the last 100 rounds of training.
We run 1,500 rounds of training on the EMNIST and Stack Overflow, and 4,000 rounds over the CIFAR-100.
Hyperparameter settings.
Unless otherwise explicitly stated, we used the following default parameter settings in the experiments, as shown in Table
12.
1
https://github.com/google-research/federated/tree/780767fdf68f2f11814d41bbbfe708274eb6d8b3/optimization
2
https://github.com/google-research/public-data-in-dpfl
3
https://papers.neurips.cc/paper/2021/hash/3016a447172f3045b65f5fc83e04b554-Abstract.html
4
https://openreview.net/forum?id=Xo0lbDt975
5
https://github.com/google-research/public-data-in-dpfl
6
https://github.com/google-research/federated/tree/master/optimization
7
https://github.com/google-research/federated/tree/master/local adaptivity
Table 12: Hyperparameter Settings
Parameter Value
Training rounds for EMNIST and Stack Overflow 1,500
Training rounds for CIFAR-100 4,000
Momentum parameter β1 0.9
Second moment parameter β2 0.99
Client learning rate with SGDM on CIFAR-100 0.03
Client learning rate with Adam on CIFAR-100 0.03
Client learning rate with AdaGrad on CIFAR-100 0.1
Server learning rate with SGDM on CIFAR-100 3.3
Server learning rate with Adam on CIFAR-100 0.3
Server learning rate with AdaGrad on CIFAR-100 0.1
Local iteration number with SGDM on CIFAR-100 5
Local iteration number with Adam on CIFAR-100 5
Local iteration number with AdaGrad on CIFAR-100 5
Fuzz factor with Adam on CIFAR-100 0.1
Fuzz factor with AdaGrad on CIFAR-100 0.1
Client learning rate with SGDM on EMNIST 0.1
Client learning rate with Adam on EMNIST 0.1
Client learning rate with AdaGrad on EMNIST 0.1
Server learning rate with SGDM on EMNIST 1
Server learning rate with Adam on EMNIST 0.1
Server learning rate with AdaGrad on EMNIST 0.1
Local iteration number with SGDM on EMNIST 10
Local iteration number with Adam on EMNIST 10
Local iteration number with AdaGrad on EMNIST 5
Fuzz factor with Adam on EMNIST 0.1
Fuzz factor with AdaGrad on EMNIST 0.1
Client learning rate with SGDM on Stack Overflow 100
Client learning rate with Adam on Stack Overflow 100
Client learning rate with AdaGrad on Stack Overflow 100
Server learning rate with SGDM on Stack Overflow 10
Server learning rate with Adam on Stack Overflow 1
Server learning rate with AdaGrad on Stack Overflow 10
Local iteration number with SGDM on Stack Overflow 5
Local iteration number with Adam on Stack Overflow 1
Local iteration number with AdaGrad on Stack Overflow 5
Fuzz factor with Adam on Stack Overflow 0.00001
Fuzz factor with AdaGrad on Stack Overflow 0.00001

paper14

Uploaded by

Copyright:

Available Formats

paper14

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

paper14

Uploaded by

Copyright:

Available Formats

Accelerated Federated Learning with Decoupled Adaptive Optimization

Abstract 2017a; Kairouz et al., 2021). In a FL model, end clients

W i (t+1) = W i (t)−∇Li (W i (t))∗η, t = 0, 1, · · · , T −1,

Table 1: Statistics of the Datasets

Dataset #Training Clients #Test Clients #Training Samples #Test Samples

0.5 5.5 16 5.5

Table 2: Final Accuracy on CIFAR-100 Table 3: Final Accuracy on EMNIST

Optimizer SGDM Adam AdaGrad Optimizer SGDM Adam AdaGrad

A.2. Momentum Decoupling Adaptive Optimization: FedDA+Adam and FedDA+AdaGrad

m(t + 1) = β1 ∗ m(t) + (1 − β1 ) ∗ g(W (t)),

with a global gradient G, i.e., P = β1 + (1 − β1 ) ∗ G.

gradient G and update the global AdaGrad optimizer with input G.

The global update rule is:

Algorithm 2 FedDA+ADAM& FedDA+AdaGrad

A.3. Advantage of Global Momentum in FedDA

2. (Bounded Gradient) kg i kL∞ < ∞ for any i = 1, 2, · · · , m.

kmi (τ ) − m(τ )k kmi (τ ) − m(τ )k

kmi (tη) − m(tη)k + kW i (tη) − W (tη)k

kg i (W̄ i (τ )) − g i (W (τ ))k ≤ Lg kW̄ i (τ ) − W (τ )k. (50)

Moreover, utilizing W̄ i (0) = W (0), we have

By variation of constants formula, we have

kW̄ i (τ ) − W (τ )k = kg i kL∞ + km(0)k + kgkL∞ )τ. (54)

Combining Eq.(50) and Eq.(54), we have

kg i (W̄ i (τ )) − g i (W (τ ))k ≤ k∇g i kL∞ kg i kL∞ + km(0)k + kgkL∞ )τ. (55)

Applying the variation of constants formula to Eq.(49), we have

Utilizing keĀt k ≤ CĀ and Eq.(55), it holds

A.4. Additional Experiments

0.0 0.0 0.0

(a) SGDM (b) Adam (c) AdaGrad

Figure 7: Convergence on Stack Overflow with Three Optimizers

80 FedLocal 70 FedLocal 70 FedLocal

(a) SGDM (b) Adam (c) AdaGrad

Figure 8: Loss on Stack Overflow with Three Optimizers

Table 5: Final Accuracy on Stack Overflow

Optimizer SGDM Adam AdaGrad

Dataset Server Learning Rate Accuracy / Client Learning Rate

Dataset Server Learning Rate Accuracy / Client Learning Rate

Dataset Server Learning Rate Accuracy / Client Learning Rate

Dataset Client Learning Rate Accuracy / Server Learning Rate

Dataset Client Learning Rate Accuracy / Server Learning Rate

Dataset Client Learning Rate Accuracy / Server Learning Rate

A.5. Experimental Details

Table 12: Hyperparameter Settings

You might also like