Fednlp: Benchmarking Federated Learning Methods For Natural Language Processing Tasks
Text Classification Question
Increasing concerns and regulations about data Answering
privacy and sparsity necessitate the study Text Generation
arXiv:2104.08815v3 [cs.CL] 6 May 2022
100 clients
in prior works to the NLP domain for generating
synthetic FL benchmarks (Li et al., 2021a). We
𝛼=1 𝛼=5 𝛼 = 10 𝛼 = 100
first introduce how we control the label distribu-
tion shift for TC and ST, then the quantity dis- Figure 2: The J-S divergence matrix between
tribution shift, and finally how we model the dis- 100 clients on the 20News dataset when α ∈
tribution shift in terms of input features for non- {1, 5, 10, 100}. Each sub-figure is a 100x100 symmet-
ric matrix. The intensity of a cell (i, j)’s color here
classification NLP tasks (e.g., summarization).
represents the distance between the label distribution
Non-IID Label Distributions. Here we present of Client i and j. It is expected that when α is smaller,
the partition over clients is more non-IID in terms of
how we synthesize the data partitions such that
their label distributions.
clients share the same (or very similar) number
of examples, but have different label distribu-
ilar number of examples. A concrete example is
tions from each other. We assume that on ev-
shown in Figure 8 (Appendix).
ery client training, examples are drawn indepen-
dently with labels following a categorical distri-
bution over L classes parameterized by a vec- Controlling non-IID Features. Although
tor q (qi ≥ 0, i ∈ [1, L] and kqk1 = 1). To syn- straightforward and effective, the above label-
thesize a population of non-identical clients, we based Dirichlet allocation method has a major
draw q ∼ DirL (αp) from a Dirichlet distribu- limitation — it is only suitable for text classifi-
tion, where p characterizes a prior class distribu- cation tasks where the outputs can be modeled
tion over L classes, and α > 0 is a concentra- as category-based random variables. To create
tion parameter controlling the identicalness among synthetic partitions for other non-classification
clients. For each client Cj , we draw a qj as its la- NLP tasks and model distribution shifts, we
bel distribution and then sample examples without thus propose a partition method based on feature
replacement from the global dataset according to clustering. Specifically, we use Sentence-
qj . With α → ∞, all clients have identical dis- BERT (Reimers and Gurevych, 2019) to encode
tributions to the prior (i.e., uniform distribution); each example to a dense vector by their text then
with α → 0, on the other extreme, each client we apply K-Means clustering to get the cluster
holds examples from only one class chosen at ran- label of each example; finally, we use these cluster
dom. In Fig. 2, we show heatmaps for visualizing labels (as if they were classification tasks) to
the distribution differences between each client. follow the steps in modeling label distribution
Figure 3 shows an example of the concrete label shift. There are two obvious benefits of this
distributions for all clients with different α. We clustering-based Dirichlet partition method: 1) It
can see that when α is smaller, the overall label enables us to easily synthesize the FL datasets for
distribution shift becomes larger. non-classification tasks (i.e., ST, QA, SS) as they
do not have discrete labels as output space; 2) The
Controlling non-IID Quantity. It is also com- BERT-based clustering results naturally imply
mon that different clients have very different data different sub-topics of a dataset, and thus feature
quantities while sharing similar label distribution. shift can be seen as a shift of latent labels — we
We thus also provide a quantity-level Dirichlet al- can reuse the same method for the label-based
location z ∼ DirN (β) where N is the number of Dirichlet partition method.
clients. Then, we can allocate examples in a global
dataset to all clients according to the distribution z Natural Factors For datasets like MRQA, we
— i.e., |Di | = zi |DG |. If we would like to model consider a cross-silo setting where each client is
both quantity and label distribution shift, it is also associated with a particular sub-dataset (out of the
easy to combine both factors. Note that one could six datasets of the same format), forming a natu-
assume it is a uniform distribution z ∼ U (N ), (or ral distribution shift based on the inherent factors
β → ∞) if we expect all clients to share a sim- such as data source and annotating style.
Task Dataset Partition Clients FedAvg FedProx FedOPT # Rounds
Text Classification 20news α =1 (label shift) 100 0.5142 0.5143 0.5349 22
Sequence Tagging OntoNotes α =0.1 (label shift) 30 0.7382 0.6731 0.7918 17
Question Answering MRQA natural factor 6 0.2707 0.2706 0.3280 13
Seq2Seq Generation Gigaword α =0.1 (feature shift) 100 0.3192 0.3169 0.3037 13
Table 2: The comparisons between different FL methods under the same setting on different NLP tasks. The
number of workers per round are 10, expect for the MRQA task, which uses 6.
Figure 4: The learning curves of the three FL Methods on four different task formulations. The metrics used for
these tasks are accuracy, span-F1, token-F1, and ROUGE respectively; The x-axis is the number of rounds.
uniform Frozen Layers # Tunable Paras. Cent. FedOpt.
0.6 label ( =1)
label ( =10) None 67.0M 86.86 55.11
0.5 label ( =5) E 43.1M 86.19 54.86
quantity ( =1) E + L0 36.0M 86.54 52.91
E + L0→1 29.0M 86.52 53.92
0.3 E + L0→2 21.9M 85.71 52.01
E + L0→3 14.8M 85.47 30.68
E + L0→4 7.7M 82.76 16.63
0.1 E + L0→5 0.6M 63.83 12.97
0 5 10 15 20
Table 3: Performance (Acc.%) on 20news (TC) when
Figure 5: Testing FedOPT with DistilBERT for different parts of DistilBERT are frozen for central-
20News under different data partition strategies. ized training and FedOpt (at 28-th round). E stands for
the embedding layer and Li means the i-th layer. The
significant lower accuracy are underlined.
This tells us that the use of AdamW as the client
optimizer may not always be a good choice, es-
pecially for a complex task such as the Seq2Seq skew partitions have a smoother curve, while the
ones, as its adaptive method for scheduling learn- variance is smaller for a larger α (e.g., 10).
ing rates might cause implicit conflicts. These ob- • Quantity skew does not introduce a great chal-
servations suggest that federated optimization al- lenge for federated learning when the label dis-
gorithms need to be tailored for various NLP tasks, tribution is closer to the uniform one.
and exploring FL-friendly model architecture or These findings suggest that it is important to
loss function can also be promising directions to design algorithms to mitigate data heterogene-
address these challenges. ity. One promising direction is personalized FL,
which enables each client to learn its personalized
Q2: How do different non-IID partitions of model via adapting its local data distribution and
the same data influence FL performance? system resources (Dinh et al., 2020; Fallah et al.,
2020; Li et al., 2021b).
The FedNLP platform supports users to inves-
tigate the performance of an FL algorithm with a Q3: How does freezing of Transformers in-
wide range of data partitioning strategies, as dis- fluence the FL performance?
cussed in §3.2. Here we look at the training curves
of the FedOPT on different partitions, as shown in Communication cost is a major concern in the
Figure 5. We reveal several findings: federated learning process. It is thus natural to
• When α is smaller (i.e., the partition is more consider freezing some Transformer layers of the
non-IID in terms of their label distribution), the client models to reduce the size of the trainable pa-
performance tends to degrade, based on the three rameters that will be transmitted between servers
curves (α = {1, 5, 10}). and clients. To study the influence of freezing lay-
• The variance is also larger when the label distri- ers on the FL performance, we conduct a series of
bution shift is larger. Both uniform and quantity- experiments that freeze the layers from the embed-
0.6 base achieves better performance, the performance
0.5 E of DistilBERT is not significantly worse. Consid-
E+L0 1
ering the communication cost (BERT-base is al-
0.4 E+L0 2 most 2x larger), we argue that using DistilBERT is
E+L0 3
0.3 E+L0 4 a more cost-effective choice for both experimental
E+L0 5 analysis and realistic applications.
0.1 5 Related Work
0 5 10 15 20 25 FL benchmarks and platforms. In the last few
years a proliferation of frameworks and bench-
Figure 6: Testing FedOPT with DistilBERT for
mark datasets have been developed to enable re-
20News under different frozen layers.
searchers to better explore and study algorithms
0.7 and modeling for federated learning, both from
bert-base academia: LEAF(Caldas et al., 2018), FedML (He
0.6 distilbert-base
et al., 2020c), Flower (Beutel et al., 2020), and
0.5 from the industry: PySyft (Ryffel et al., 2018),
0.4 TensorFlow-Federated (TFF) (Ingerman and Os-
0.3 trowski, 2019), FATE (Yang et al., 2019), Clara
(NVIDIA, 2019), PaddleFL (Ma et al., 2019),
Open FL (Intel®, 2021). However, most platforms
0.1 only focus on designing a unified framework for
0.0 federated learning methods and do not provide
0 5 10 15 20 25 a dedicated environment for studying NLP prob-
Figure 7: FedOPT for 20News with different LMs. lems with FL methods. LEAF (Caldas et al., 2018)
contains a few text datasets, however, it is limited
ding layer (E) to the top layer (L5 ) of DistilBERT to classification and next-word prediction datasets
with both centralized training and FedOPT on the and does not consider the pre-trained language
text classification task. models. We want to provide a dedicated platform
We report our results in Table 3 and Figure 6. for studying FL methods in realistic NLP applica-
We find that in centralized training, the largest tions with state-of-the-art language models.
performance gain happens when we unfreeze the
Federated learning in NLP applications.
last layer, while in FedOPT we have to unfreeze
There are a few prior works that have begun
the last three layers to enjoy a comparable per-
to apply FL methods in privacy-oriented NLP
formance with the full model. This suggests that
applications. For example, federated learning has
reducing communication costs via freezing some
been applied to many keyboard-related applica-
layers of Transformer LMs is feasible, though one
tions including (Hard et al., 2018; Stremmel and
should be aware that the experience in centralized
Singh, 2020; Leroy et al., 2019; Ramaswamy
training may not generalize to the FL experiments.
et al., 2019; Yang et al., 2018a), sentence-level
Q4: Are compact model DistilBERT ade- text intent classification using Text-CNN (Zhu
quate for FL+NLP? et al., 2020), and pretraining and fine-tuning of
BERT using medical data from multiple silos
We know that BERT has a better performance than without fetching all data to the same place (Liu
DistilBERT for its larger model size. However, and Miller, 2020). FL methods also have been
is it cost-effective to use BERT rather than Dis- proposed to train high-quality language models
tilBERT? To study this, we compare the perfor- that can outperform the models trained without
mance of both models with FedOPT on text classi- federated learning (Ji et al., 2019; Chen et al.,
fication, sharing the same setting as the above ex- 2019). Besides these applications, some work
periments. As shown in Figure 7, although BERT- has been done in medical relation extractions (Ge
et al., 2020) and medical name entity recognition trails, and recourse so that their predictions can be
(Sui et al., 2020). These methods use federated explained to and critiqued by affected parties.
learning to preserve the privacy of sensitive
Limitations. One limitation of our work is that
medical data and learn data on different platforms,
we have not analyzed the privacy leakage of FL
excluding the need for exchanging data between
methods. We argue that novel privacy-centric
different platforms.
measures are orthogonal to the development of FL
Our work aims to provide a unified platform
methods, which is beyond the scope of our work.
for studying various NLP applications in a shared
How to fairly analyze the privacy leakage is now
environment so that researchers can better design
still an open problem for both FL and NLP, and
new FL methods either for a specific NLP task or
it is only possible to study this when we have an
as a general-purpose model. The aforementioned
existing platform like FedNLP.
prior works would thus be a particular instance of
the settings supported by the FedNLP platform.
6 Conclusion and Future Directions This work is supported in part by a research
Our key contribution is providing a thorough and grant and an Amazon ML Fellowship from USC-
insightful empirical analysis of existing federated Amazon Center on Secure and Trustworthy AI
learning algorithms in the context of NLP mod- (https://trustedai.usc.edu). Xiang
els. Notably, We compare typical FL methods Ren is supported in part by the Office of the
for four NLP task formulations under multiple Director of National Intelligence (ODNI), In-
non-IID data partitions. Our findings reveal both telligence Advanced Research Projects Activity
promise and the challenges of FL for NLP. In ad- (IARPA), via Contract No. 2019-19051600007,
dition, we also provide a suite of resources to sup- the DARPA MCS program under Contract No.
port future research in FL for NLP (e.g., a unify- N660011924033, the Defense Advanced Research
ing framework for connecting Transformer mod- Projects Agency with award W911NF-19-20271,
els with popular FL methods and different non-IID NSF IIS 2048211, NSF SMA 1829268, and gift
partition strategies). Thus, we believe our well- awards from Google, Amazon, JP Morgan and
maintained open-source codebase to support fu- Sony. Mahdi Soltanolkotabi is supported by the
ture work in this area. Packard Fellowship in Science and Engineering,
Promising future directions in FL for NLP in- a Sloan Research Fellowship in Mathematics, an
clude: 1) minimizing the performance gap, 2) im- NSF-CAREER under award #1846369, DARPA
proving the system efficiency and scalability, 3) Learning with Less Labels (LwLL) and FastNICS
trustworthy and privacy-preserving NLP, 4) per- programs, and NSF-CIF awards #1813877 and
sonalized FL methods for NLP, etc. (Please see #2008443.
Appendix E for more details.)
to improve model accuracy or fairness is a very # create model by specifying the task
client_model, ... = create_model(model_args,
promising direction. In addition, it is also an inter- formulation="classification")
esting problem to adapt the heterogeneous model # define a customized NLP Trainer
client_trainer = TCTrainer(device,
architecture for each client in the FL network. We client_model, ...)
show that it is feasible to only fine-tune a small # launch the federated training (e.g., FedAvg)
amount of the parameters of LMs, so it is promis- FedAvg_distributed(..., device,
ing to adapt recent prefix-tuning methods (Li and data_dict, ...,
Liang, 2021) for personalizing the parameters of
NLP models within the FedNLP framework.
F The System Design of FedNLP types of DataManager according to the task def-
inition. Users can customize their DataManager
The FedNLP platform consists of three layers: by inheriting one of the DataManager class, spec-
the application layer, the algorithm layer, and ifying data operation functions, and embedding a
the infrastructure layer. At the application layer, particular preprocessor. Note that the raw data’s
FedNLP provides three modules: data manage- H5Py file and the non-IID partition file are pre-
ment, model definition, and single-process trainer processed offline, while DataManager only loads
for all task formats; At the algorithm layer, them in runtime.
FedNLP supports various FL algorithms; At the
infrastructure layer, FedNLP aims at integrating Model Definition. We support two types of
single-process trainers with a distributed learning models: Transformer and LSTM. For Transformer
system for FL. Specifically, we make each layer models, to dock with the existing NLP ecology,
and module perform its own duties and have a high our framework is compatible with the Hugging-
degree of modularization. Face Transformers library (Wolf et al., 2020), so
that various types of Transformers can be directly
F.1 Overall Workflow reused without the need for re-implementation.
The module calling logic flow of the whole frame- Specifically, our code is compatible with the three
work is shown on the left of Figure 9. When main classes of Tokenizer, Model, and Config
we start the federated training, we first complete in HuggingFace. Users can also customize them
the launcher script, device allocation, data load- based on HuggingFace’s code. Although LSTM
ing, and model creation, and finally call the API has gradually deviated from the mainstream, we
of the federated learning algorithm. This process still support LSTM to reflect the framework’s in-
is expressed in Python-style code (see Alg. 2). tegrity, which may meet some particular use cases
in a federated setting.
F.2 The Application Layer NLP Trainer (single process perspective). As
Data Management. In data management, What for the task-specific NLP Trainer, the most
DataManager does is control the whole workflow prominent feature is that it does not require users
from loading data to returning trainable features. to have any background in distributed comput-
To be specific, DataManager is set up for read- ing. Users of FedNLP only need to complete
ing h5py data files and driving a preprocessor single-process code writing. A user should in-
to convert raw data to features. There are four herit the Trainer class in the application layer
Figure 9: The overall workflow and system design of the proposed FedNLP platform.
to implement the four methods as shown in the learning an NLP model on a dataset.
figure: 1. the get_model_params() interface al-
lows the algorithm layer to obtain model param- FedAvg (McMahan et al., 2017a) is the de facto
eters and transmit them to the server; 2. the method for federated learning, assuming both
set_model_params() interface obtains the up-
client and server use the SGD optimizer for up-
dated model from the server’s aggregation and dating model weights.
then updates the model parameters of the local FedProx (Li et al., 2020b) can tackle statistical
model; 3. the programming of the train() and heterogeneity by restricting the local model up-
test() function only needs to consider the data dates to be closer to the initial (global) model with
of a single user, meaning that the trainer is com- L2 regularization for better stability in training.
pletely consistent with the centralized training.
FedOPT (Reddi et al., 2021) is a generalized
F.3 The Algorithm Layer version of FedAvg. There are two gradient-based
In the design of the algorithm layer, we follow optimizers in the algorithm: ClientOpt and
the principle of one-line API. The parameters of ServerOpt (please refer to the pseudo code in
the API include model, data, and single-process the original paper (Reddi et al., 2021)). While
trainer (as shown in Algorithm 2). The algorithms ClientOpt is used to update the local models,
we support include: SerevrOpt treats the negative of aggregated lo-
cal changes −∆(t) as a pseudo-gradient and ap-
Centralized Training. We concatenate all client plies it on the global model. In our FedNLP frame-
datasets and use the global data DG to train a work, by default, we set the ClientOpt to be
global model — i.e., the conventional protocol for AdamW (Loshchilov and Hutter, 2019) and the
SerevrOpt to be SGD with momentum (0.9) ??). The main idea of LightSecAgg are that
and fix server learning rate as 1.0. each user protects its local model using a locally
Each algorithm includes two core objects, generated random mask. This mask is then en-
ServerManager and ClientManager, which in- coded and shared with other users, in such a way
tegrate the communication module ComManager that the aggregate mask of any sufficiently large
from the infrastructure layer and the Trainer of set of surviving users can be directly reconstructed
the training engine to complete the distributed al- at the server. Our main effort in FedNLP is in-
gorithm protocol and edge training. Note that tegrating these algorithms, optimizing its system
users can customize the Trainer by passing a cus- performance, and designing user-friendly APIs to
tomized Trainer through the algorithm API. make them compatible with NLP models and FL
F.4 The Infrastructure Layer
The infrastructure layer includes three modules:
1) Users can write distributed scripts to man-
age GPU resource allocation. In particular,
FedNLP provides the GPU assignment API
(map_process_to_gpu() in Algorithm 2) to as-
sign specific GPUs to different FL Clients.
2) The algorithm layer can use a unified and ab-
stract ComManager to complete a complex al-
gorithmic communication protocol. Currently,
we support MPI (Message Passing Interface),
RPC (Remote procedure call), and MQTT (Mes-
sage Queuing Telemetry Transport) communica-
tion backend. MPI meets the distributed training
needs in a single cluster; RPC meets the communi-
cation needs of cross-data centers (e.g., cross-silo
federated learning); MQTT can meet the commu-
nication needs of smartphones or IoT devices.
3) The third part is the training engine, which
reuses the existing deep learning training engines
by presenting as the Trainer class. Our cur-
rent version of this module is built on PyTorch,
but it can easily support frameworks such as
TensorFlow. In the future, we may consider sup-
porting the lightweight edge training engine opti-
mized by the compiler technology at this level.