Fednlp: Benchmarking Federated Learning Methods For Natural Language Processing Tasks
Fednlp: Benchmarking Federated Learning Methods For Natural Language Processing Tasks
Fednlp: Benchmarking Federated Learning Methods For Natural Language Processing Tasks
Abstract
Text Classification Question
Increasing concerns and regulations about data Answering
privacy and sparsity necessitate the study Text Generation
arXiv:2104.08815v3 [cs.CL] 6 May 2022
100 clients
in prior works to the NLP domain for generating
synthetic FL benchmarks (Li et al., 2021a). We
𝛼=1 𝛼=5 𝛼 = 10 𝛼 = 100
first introduce how we control the label distribu-
tion shift for TC and ST, then the quantity dis- Figure 2: The J-S divergence matrix between
tribution shift, and finally how we model the dis- 100 clients on the 20News dataset when α ∈
tribution shift in terms of input features for non- {1, 5, 10, 100}. Each sub-figure is a 100x100 symmet-
ric matrix. The intensity of a cell (i, j)’s color here
classification NLP tasks (e.g., summarization).
represents the distance between the label distribution
Non-IID Label Distributions. Here we present of Client i and j. It is expected that when α is smaller,
the partition over clients is more non-IID in terms of
how we synthesize the data partitions such that
their label distributions.
clients share the same (or very similar) number
of examples, but have different label distribu-
ilar number of examples. A concrete example is
tions from each other. We assume that on ev-
shown in Figure 8 (Appendix).
ery client training, examples are drawn indepen-
dently with labels following a categorical distri-
bution over L classes parameterized by a vec- Controlling non-IID Features. Although
tor q (qi ≥ 0, i ∈ [1, L] and kqk1 = 1). To syn- straightforward and effective, the above label-
thesize a population of non-identical clients, we based Dirichlet allocation method has a major
draw q ∼ DirL (αp) from a Dirichlet distribu- limitation — it is only suitable for text classifi-
tion, where p characterizes a prior class distribu- cation tasks where the outputs can be modeled
tion over L classes, and α > 0 is a concentra- as category-based random variables. To create
tion parameter controlling the identicalness among synthetic partitions for other non-classification
clients. For each client Cj , we draw a qj as its la- NLP tasks and model distribution shifts, we
bel distribution and then sample examples without thus propose a partition method based on feature
replacement from the global dataset according to clustering. Specifically, we use Sentence-
qj . With α → ∞, all clients have identical dis- BERT (Reimers and Gurevych, 2019) to encode
tributions to the prior (i.e., uniform distribution); each example to a dense vector by their text then
with α → 0, on the other extreme, each client we apply K-Means clustering to get the cluster
holds examples from only one class chosen at ran- label of each example; finally, we use these cluster
dom. In Fig. 2, we show heatmaps for visualizing labels (as if they were classification tasks) to
the distribution differences between each client. follow the steps in modeling label distribution
Figure 3 shows an example of the concrete label shift. There are two obvious benefits of this
distributions for all clients with different α. We clustering-based Dirichlet partition method: 1) It
can see that when α is smaller, the overall label enables us to easily synthesize the FL datasets for
distribution shift becomes larger. non-classification tasks (i.e., ST, QA, SS) as they
do not have discrete labels as output space; 2) The
Controlling non-IID Quantity. It is also com- BERT-based clustering results naturally imply
mon that different clients have very different data different sub-topics of a dataset, and thus feature
quantities while sharing similar label distribution. shift can be seen as a shift of latent labels — we
We thus also provide a quantity-level Dirichlet al- can reuse the same method for the label-based
location z ∼ DirN (β) where N is the number of Dirichlet partition method.
clients. Then, we can allocate examples in a global
dataset to all clients according to the distribution z Natural Factors For datasets like MRQA, we
— i.e., |Di | = zi |DG |. If we would like to model consider a cross-silo setting where each client is
both quantity and label distribution shift, it is also associated with a particular sub-dataset (out of the
easy to combine both factors. Note that one could six datasets of the same format), forming a natu-
assume it is a uniform distribution z ∼ U (N ), (or ral distribution shift based on the inherent factors
β → ∞) if we expect all clients to share a sim- such as data source and annotating style.
Task Dataset Partition Clients FedAvg FedProx FedOPT # Rounds
Text Classification 20news α =1 (label shift) 100 0.5142 0.5143 0.5349 22
Sequence Tagging OntoNotes α =0.1 (label shift) 30 0.7382 0.6731 0.7918 17
Question Answering MRQA natural factor 6 0.2707 0.2706 0.3280 13
Seq2Seq Generation Gigaword α =0.1 (feature shift) 100 0.3192 0.3169 0.3037 13
Table 2: The comparisons between different FL methods under the same setting on different NLP tasks. The
number of workers per round are 10, expect for the MRQA task, which uses 6.
Figure 4: The learning curves of the three FL Methods on four different task formulations. The metrics used for
these tasks are accuracy, span-F1, token-F1, and ROUGE respectively; The x-axis is the number of rounds.
0.7
uniform Frozen Layers # Tunable Paras. Cent. FedOpt.
0.6 label ( =1)
label ( =10) None 67.0M 86.86 55.11
0.5 label ( =5) E 43.1M 86.19 54.86
quantity ( =1) E + L0 36.0M 86.54 52.91
0.4
E + L0→1 29.0M 86.52 53.92
0.3 E + L0→2 21.9M 85.71 52.01
E + L0→3 14.8M 85.47 30.68
0.2
E + L0→4 7.7M 82.76 16.63
0.1 E + L0→5 0.6M 63.83 12.97
0 5 10 15 20
Table 3: Performance (Acc.%) on 20news (TC) when
Figure 5: Testing FedOPT with DistilBERT for different parts of DistilBERT are frozen for central-
20News under different data partition strategies. ized training and FedOpt (at 28-th round). E stands for
the embedding layer and Li means the i-th layer. The
significant lower accuracy are underlined.
This tells us that the use of AdamW as the client
optimizer may not always be a good choice, es-
pecially for a complex task such as the Seq2Seq skew partitions have a smoother curve, while the
ones, as its adaptive method for scheduling learn- variance is smaller for a larger α (e.g., 10).
ing rates might cause implicit conflicts. These ob- • Quantity skew does not introduce a great chal-
servations suggest that federated optimization al- lenge for federated learning when the label dis-
gorithms need to be tailored for various NLP tasks, tribution is closer to the uniform one.
and exploring FL-friendly model architecture or These findings suggest that it is important to
loss function can also be promising directions to design algorithms to mitigate data heterogene-
address these challenges. ity. One promising direction is personalized FL,
which enables each client to learn its personalized
Q2: How do different non-IID partitions of model via adapting its local data distribution and
the same data influence FL performance? system resources (Dinh et al., 2020; Fallah et al.,
2020; Li et al., 2021b).
The FedNLP platform supports users to inves-
tigate the performance of an FL algorithm with a Q3: How does freezing of Transformers in-
wide range of data partitioning strategies, as dis- fluence the FL performance?
cussed in §3.2. Here we look at the training curves
of the FedOPT on different partitions, as shown in Communication cost is a major concern in the
Figure 5. We reveal several findings: federated learning process. It is thus natural to
• When α is smaller (i.e., the partition is more consider freezing some Transformer layers of the
non-IID in terms of their label distribution), the client models to reduce the size of the trainable pa-
performance tends to degrade, based on the three rameters that will be transmitted between servers
curves (α = {1, 5, 10}). and clients. To study the influence of freezing lay-
• The variance is also larger when the label distri- ers on the FL performance, we conduct a series of
bution shift is larger. Both uniform and quantity- experiments that freeze the layers from the embed-
0.6 base achieves better performance, the performance
None
0.5 E of DistilBERT is not significantly worse. Consid-
E+L0
E+L0 1
ering the communication cost (BERT-base is al-
0.4 E+L0 2 most 2x larger), we argue that using DistilBERT is
E+L0 3
0.3 E+L0 4 a more cost-effective choice for both experimental
E+L0 5 analysis and realistic applications.
0.2
0.1 5 Related Work
0.0
0 5 10 15 20 25 FL benchmarks and platforms. In the last few
years a proliferation of frameworks and bench-
Figure 6: Testing FedOPT with DistilBERT for
mark datasets have been developed to enable re-
20News under different frozen layers.
searchers to better explore and study algorithms
0.7 and modeling for federated learning, both from
bert-base academia: LEAF(Caldas et al., 2018), FedML (He
0.6 distilbert-base
et al., 2020c), Flower (Beutel et al., 2020), and
0.5 from the industry: PySyft (Ryffel et al., 2018),
0.4 TensorFlow-Federated (TFF) (Ingerman and Os-
0.3 trowski, 2019), FATE (Yang et al., 2019), Clara
(NVIDIA, 2019), PaddleFL (Ma et al., 2019),
0.2
Open FL (Intel®, 2021). However, most platforms
0.1 only focus on designing a unified framework for
0.0 federated learning methods and do not provide
0 5 10 15 20 25 a dedicated environment for studying NLP prob-
Figure 7: FedOPT for 20News with different LMs. lems with FL methods. LEAF (Caldas et al., 2018)
contains a few text datasets, however, it is limited
ding layer (E) to the top layer (L5 ) of DistilBERT to classification and next-word prediction datasets
with both centralized training and FedOPT on the and does not consider the pre-trained language
text classification task. models. We want to provide a dedicated platform
We report our results in Table 3 and Figure 6. for studying FL methods in realistic NLP applica-
We find that in centralized training, the largest tions with state-of-the-art language models.
performance gain happens when we unfreeze the
Federated learning in NLP applications.
last layer, while in FedOPT we have to unfreeze
There are a few prior works that have begun
the last three layers to enjoy a comparable per-
to apply FL methods in privacy-oriented NLP
formance with the full model. This suggests that
applications. For example, federated learning has
reducing communication costs via freezing some
been applied to many keyboard-related applica-
layers of Transformer LMs is feasible, though one
tions including (Hard et al., 2018; Stremmel and
should be aware that the experience in centralized
Singh, 2020; Leroy et al., 2019; Ramaswamy
training may not generalize to the FL experiments.
et al., 2019; Yang et al., 2018a), sentence-level
Q4: Are compact model DistilBERT ade- text intent classification using Text-CNN (Zhu
quate for FL+NLP? et al., 2020), and pretraining and fine-tuning of
BERT using medical data from multiple silos
We know that BERT has a better performance than without fetching all data to the same place (Liu
DistilBERT for its larger model size. However, and Miller, 2020). FL methods also have been
is it cost-effective to use BERT rather than Dis- proposed to train high-quality language models
tilBERT? To study this, we compare the perfor- that can outperform the models trained without
mance of both models with FedOPT on text classi- federated learning (Ji et al., 2019; Chen et al.,
fication, sharing the same setting as the above ex- 2019). Besides these applications, some work
periments. As shown in Figure 7, although BERT- has been done in medical relation extractions (Ge
et al., 2020) and medical name entity recognition trails, and recourse so that their predictions can be
(Sui et al., 2020). These methods use federated explained to and critiqued by affected parties.
learning to preserve the privacy of sensitive
Limitations. One limitation of our work is that
medical data and learn data on different platforms,
we have not analyzed the privacy leakage of FL
excluding the need for exchanging data between
methods. We argue that novel privacy-centric
different platforms.
measures are orthogonal to the development of FL
Our work aims to provide a unified platform
methods, which is beyond the scope of our work.
for studying various NLP applications in a shared
How to fairly analyze the privacy leakage is now
environment so that researchers can better design
still an open problem for both FL and NLP, and
new FL methods either for a specific NLP task or
it is only possible to study this when we have an
as a general-purpose model. The aforementioned
existing platform like FedNLP.
prior works would thus be a particular instance of
the settings supported by the FedNLP platform.
Acknowledgements
6 Conclusion and Future Directions This work is supported in part by a research
Our key contribution is providing a thorough and grant and an Amazon ML Fellowship from USC-
insightful empirical analysis of existing federated Amazon Center on Secure and Trustworthy AI
learning algorithms in the context of NLP mod- (https://trustedai.usc.edu). Xiang
els. Notably, We compare typical FL methods Ren is supported in part by the Office of the
for four NLP task formulations under multiple Director of National Intelligence (ODNI), In-
non-IID data partitions. Our findings reveal both telligence Advanced Research Projects Activity
promise and the challenges of FL for NLP. In ad- (IARPA), via Contract No. 2019-19051600007,
dition, we also provide a suite of resources to sup- the DARPA MCS program under Contract No.
port future research in FL for NLP (e.g., a unify- N660011924033, the Defense Advanced Research
ing framework for connecting Transformer mod- Projects Agency with award W911NF-19-20271,
els with popular FL methods and different non-IID NSF IIS 2048211, NSF SMA 1829268, and gift
partition strategies). Thus, we believe our well- awards from Google, Amazon, JP Morgan and
maintained open-source codebase to support fu- Sony. Mahdi Soltanolkotabi is supported by the
ture work in this area. Packard Fellowship in Science and Engineering,
Promising future directions in FL for NLP in- a Sloan Research Fellowship in Mathematics, an
clude: 1) minimizing the performance gap, 2) im- NSF-CAREER under award #1846369, DARPA
proving the system efficiency and scalability, 3) Learning with Less Labels (LwLL) and FastNICS
trustworthy and privacy-preserving NLP, 4) per- programs, and NSF-CIF awards #1813877 and
sonalized FL methods for NLP, etc. (Please see #2008443.
Appendix E for more details.)
Matthew Dunn, Levent Sagun, Mike Higgins, V. U. Chaoyang He, Haishan Ye, Li Shen, and Tong Zhang.
Güney, Volkan Cirik, and Kyunghyun Cho. 2017. 2020d. Milenas: Efficient neural architecture search
Searchqa: A new q&a dataset augmented with con- via mixed-level reformulation. In 2020 IEEE/CVF
text from a search engine. ArXiv. Conference on Computer Vision and Pattern Recog-
nition, CVPR 2020, Seattle, WA, USA, June 13-19,
A. Elkordy and A. Avestimehr. 2020. Secure aggre- 2020, pages 11990–11999. IEEE.
gation with heterogeneous quantization in federated
learning. ArXiv.
Alex Ingerman and Krzys Ostrowski. 2019. Tensor-
P. Kairouz et al. 2019. Advances and open problems in Flow Federated.
federated learning. ArXiv.
Intel®. 2021. Intel® open federated learning.
Alireza Fallah, Aryan Mokhtari, and Asuman
Ozdaglar. 2020. Personalized federated learning: A Shaoxiong Ji, Shirui Pan, Guodong Long, Xue Li, Jing
meta-learning approach. ArXiv preprint. Jiang, and Zi Huang. 2019. Learning private neural
Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, language modeling with attentive aggregation. 2019
Eunsol Choi, and Danqi Chen. 2019. MRQA 2019 International Joint Conference on Neural Networks
shared task: Evaluating generalization in reading (IJCNN).
comprehension. In Proceedings of the 2nd Work-
shop on Machine Reading for Question Answering, Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
pages 1–13, Hong Kong, China. Association for Zettlemoyer. 2017. TriviaQA: A large scale dis-
Computational Linguistics. tantly supervised challenge dataset for reading com-
prehension. In Proceedings of the 55th Annual
Suyu Ge, Fangzhao Wu, Chuhan Wu, Tao Qi, Meeting of the Association for Computational Lin-
Yongfeng Huang, and X. Xie. 2020. Fedner: guistics (Volume 1: Long Papers), pages 1601–
Privacy-preserving medical named entity recogni- 1611, Vancouver, Canada. Association for Compu-
tion with federated learning. ArXiv. tational Linguistics.
Peter Kairouz, H Brendan McMahan, Brendan Avent, 11th International Joint Conference on Natural Lan-
Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, guage Processing (Volume 1: Long Papers), pages
Keith Bonawitz, Zachary Charles, Graham Cor- 4582–4597, Online. Association for Computational
mode, Rachel Cummings, et al. 2019. Advances Linguistics.
and open problems in federated learning. ArXiv
preprint. D. Liu and T. Miller. 2020. Federated pretraining and
fine tuning of bert using clinical notes from multiple
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- silos. ArXiv.
field, Michael Collins, Ankur Parikh, Chris Al-
berti, Danielle Epstein, Illia Polosukhin, Jacob De- Ilya Loshchilov and Frank Hutter. 2019. Decou-
vlin, Kenton Lee, Kristina Toutanova, Llion Jones, pled weight decay regularization. In 7th Inter-
Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, national Conference on Learning Representations,
Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
Natural questions: A benchmark for question an- OpenReview.net.
swering research. Transactions of the Association
Lingjuan Lyu, Han Yu, Xingjun Ma, Lichao Sun, Jun
for Computational Linguistics, 7:452–466.
Zhao, Qiang Yang, and Philip S Yu. 2020. Privacy
Ken Lang. 1995. Newsweeder: Learning to filter net- and robustness in federated learning: Attacks and
news. In Proc. of ICML. defenses. ArXiv preprint.
Yanjun Ma, Dianhai Yu, Tian Wu, and Haifeng Wang.
David Leroy, Alice Coucke, Thibaut Lavril, Thibault
2019. Paddlepaddle: An open-source deep learning
Gisselbrecht, and Joseph Dureau. 2019. Federated
platform from industrial practice. Frontiers of Data
learning for keyword spotting. In IEEE Interna-
and Domputing, (1).
tional Conference on Acoustics, Speech and Signal
Processing, ICASSP 2019, Brighton, United King- Brendan McMahan, Eider Moore, Daniel Ramage,
dom, May 12-17, 2019, pages 6341–6345. IEEE. Seth Hampson, and Blaise Agüera y Arcas. 2017a.
Communication-efficient learning of deep networks
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- from decentralized data. In Proceedings of the 20th
jan Ghazvininejad, Abdelrahman Mohamed, Omer International Conference on Artificial Intelligence
Levy, Veselin Stoyanov, and Luke Zettlemoyer. and Statistics, AISTATS 2017, 20-22 April 2017,
2020. BART: Denoising sequence-to-sequence pre- Fort Lauderdale, FL, USA, volume 54 of Proceed-
training for natural language generation, translation, ings of Machine Learning Research, pages 1273–
and comprehension. In Proceedings of the 58th An- 1282. PMLR.
nual Meeting of the Association for Computational
Linguistics, pages 7871–7880, Online. Association Brendan McMahan, Eider Moore, Daniel Ramage,
for Computational Linguistics. Seth Hampson, and Blaise Agüera y Arcas. 2017b.
Communication-efficient learning of deep networks
Q. Li, Yiqun Diao, Quan Chen, and Bingsheng He. from decentralized data. In Proceedings of the 20th
2021a. Federated learning on non-iid data silos: An International Conference on Artificial Intelligence
experimental study. ArXiv. and Statistics, AISTATS 2017, 20-22 April 2017,
Fort Lauderdale, FL, USA, volume 54 of Proceed-
Tian Li, Shengyuan Hu, Ahmad Beirami, and Vir-
ings of Machine Learning Research, pages 1273–
ginia Smith. 2021b. Ditto: Fair and robust feder-
1282. PMLR.
ated learning through personalization. In Proceed-
ings of the 38th International Conference on Ma- NVIDIA. 2019. Nvidia clara.
chine Learning, ICML 2021, 18-24 July 2021, Vir-
tual Event, volume 139 of Proceedings of Machine Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,
Learning Research, pages 6357–6368. PMLR. Hwee Tou Ng, Anders Björkelund, Olga Uryupina,
Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-
Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and bust linguistic analysis using OntoNotes. In Pro-
V. Smith. 2020a. Federated learning: Challenges, ceedings of the Seventeenth Conference on Com-
methods, and future directions. IEEE Signal Pro- putational Natural Language Learning, pages 143–
cessing Magazine. 152, Sofia, Bulgaria. Association for Computational
Linguistics.
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar San-
jabi, Ameet Talwalkar, and Virginia Smith. 2020b. Saurav Prakash and Amir Salman Avestimehr. 2020.
Federated optimization in heterogeneous networks. Mitigating byzantine attacks in federated learning.
In Proceedings of Machine Learning and Systems ArXiv preprint.
2020, MLSys 2020, Austin, TX, USA, March 2-4,
2020. mlsys.org. Saurav Prakash, Sagar Dhakal, Mustafa Riza Akdeniz,
Yair Yona, Shilpa Talwar, Salman Avestimehr, and
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Nageen Himayat. 2020. Coded computing for low-
Optimizing continuous prompts for generation. In latency federated learning over wireless edge net-
Proceedings of the 59th Annual Meeting of the works. IEEE Journal on Selected Areas in Commu-
Association for Computational Linguistics and the nications, (1).
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Jinhyun So, Başak Güler, and A Salman Avestimehr.
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, 2021a. Codedprivateml: A fast and privacy-
Wei Li, and Peter J Liu. 2020. Exploring the limits preserving framework for distributed machine learn-
of transfer learning with a unified text-to-text trans- ing. IEEE Journal on Selected Areas in Information
former. Journal of Machine Learning Research, Theory, (1).
(140).
Jinhyun So, Başak Güler, and A Salman Avestimehr.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and 2021b. Turbo-aggregate: Breaking the quadratic ag-
Percy Liang. 2016. SQuAD: 100,000+ questions for gregation barrier in secure federated learning. IEEE
machine comprehension of text. In Proceedings of Journal on Selected Areas in Information Theory,
the 2016 Conference on Empirical Methods in Natu- (1).
ral Language Processing, pages 2383–2392, Austin,
Texas. Association for Computational Linguistics. Joel Stremmel and Arjun Singh. 2020. Pretraining fed-
erated text models for next word prediction. ArXiv.
Swaroop Indra Ramaswamy, Rajiv Mathews, K. Rao, Dianbo Sui, Yubo Chen, Jun Zhao, Yantao Jia, Yuan-
and Franccoise Beaufays. 2019. Federated learning tao Xie, and Weijian Sun. 2020. FedED: Feder-
for emoji prediction in a mobile keyboard. ArXiv. ated learning via ensemble distillation for medical
relation extraction. In Proceedings of the 2020
Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Conference on Empirical Methods in Natural Lan-
Zachary Garrett, Keith Rush, Jakub Konečný, San- guage Processing (EMNLP), pages 2118–2128, On-
jiv Kumar, and Hugh Brendan McMahan. 2021. line. Association for Computational Linguistics.
Adaptive federated optimization. In 9th Inter-
national Conference on Learning Representations, Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-
ICLR 2021, Virtual Event, Austria, May 3-7, 2021. ris, Alessandro Sordoni, Philip Bachman, and Ka-
OpenReview.net. heer Suleman. 2017. NewsQA: A machine compre-
hension dataset. In Proceedings of the 2nd Work-
General Data Protection Regulation. 2016. Regu- shop on Representation Learning for NLP, pages
lation eu 2016/679 of the european parliament 191–200, Vancouver, Canada. Association for Com-
and of the council of 27 april 2016. Offi- putational Linguistics.
cial Journal of the European Union. Avail-
able at: http://ec. europa. eu/justice/data- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
protection/reform/files/regulation_oj_en. pdf Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
(accessed 20 September 2017). Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
Nils Reimers and Iryna Gurevych. 2019. Sentence- cessing Systems 30: Annual Conference on Neural
BERT: Sentence embeddings using Siamese BERT- Information Processing Systems 2017, December 4-
networks. In Proceedings of the 2019 Conference on 9, 2017, Long Beach, CA, USA, pages 5998–6008.
Empirical Methods in Natural Language Process-
ing and the 9th International Joint Conference on Hongyi Wang, Kartik Sreenivasan, Shashank Ra-
Natural Language Processing (EMNLP-IJCNLP), jput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong
pages 3982–3992, Hong Kong, China. Association Sohn, Kangwook Lee, and Dimitris S. Papailiopou-
for Computational Linguistics. los. 2020a. Attack of the tails: Yes, you really can
backdoor federated learning. In Advances in Neural
Theo Ryffel, Andrew Trask, Morten Dahl, Bobby Wag- Information Processing Systems 33: Annual Con-
ner, Jason Mancuso, Daniel Rueckert, and Jonathan ference on Neural Information Processing Systems
Passerat-Palmbach. 2018. A generic framework for 2020, NeurIPS 2020, December 6-12, 2020, virtual.
privacy preserving deep learning. ArXiv preprint. Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dim-
itris S. Papailiopoulos, and Yasaman Khazaeni.
Victor Sanh, Lysandre Debut, Julien Chaumond, and 2020b. Federated learning with matched averaging.
Thomas Wolf. 2019. Distilbert, a distilled version In 8th International Conference on Learning Repre-
of bert: smaller, faster, cheaper and lighter. ArXiv. sentations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020. OpenReview.net.
Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and
Ameet S. Talwalkar. 2017. Federated multi-task Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
learning. In Advances in Neural Information Pro- Chaumond, Clement Delangue, Anthony Moi, Pier-
cessing Systems 30: Annual Conference on Neural ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
Information Processing Systems 2017, December 4- icz, Joe Davison, Sam Shleifer, Patrick von Platen,
9, 2017, Long Beach, CA, USA, pages 4424–4434. Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Jinhyun So, Başak Güler, and A Salman Avestimehr. Quentin Lhoest, and Alexander Rush. 2020. Trans-
2020. Byzantine-resilient secure federated learn- formers: State-of-the-art natural language process-
ing. IEEE Journal on Selected Areas in Commu- ing. In Proceedings of the 2020 Conference on Em-
nications. pirical Methods in Natural Language Processing:
System Demonstrations, pages 38–45, Online. As-
sociation for Computational Linguistics.
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin
Tong. 2019. Federated machine learning: Concept
and applications. ACM Transactions on Intelligent
Systems and Technology (TIST), (2).
T. Yang, G. Andrew, Hubert Eichner, Haicheng Sun,
W. Li, Nicholas Kong, D. Ramage, and F. Beau-
fays. 2018a. Applied federated learning: Improving
google keyboard query suggestions. ArXiv.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-
gio, William Cohen, Ruslan Salakhutdinov, and
Christopher D. Manning. 2018b. HotpotQA: A
dataset for diverse, explainable multi-hop question
answering. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Pro-
cessing, pages 2369–2380, Brussels, Belgium. As-
sociation for Computational Linguistics.
Ligeng Zhu, Zhijian Liu, and Song Han. 2019. Deep
leakage from gradients. In Advances in Neural
Information Processing Systems 32: Annual Con-
ference on Neural Information Processing Systems
2019, NeurIPS 2019, December 8-14, 2019, Van-
couver, BC, Canada, pages 14747–14756.
to improve model accuracy or fairness is a very # create model by specifying the task
client_model, ... = create_model(model_args,
promising direction. In addition, it is also an inter- formulation="classification")
esting problem to adapt the heterogeneous model # define a customized NLP Trainer
client_trainer = TCTrainer(device,
architecture for each client in the FL network. We client_model, ...)
show that it is feasible to only fine-tune a small # launch the federated training (e.g., FedAvg)
amount of the parameters of LMs, so it is promis- FedAvg_distributed(..., device,
client_model,
ing to adapt recent prefix-tuning methods (Li and data_dict, ...,
client_trainer)
Liang, 2021) for personalizing the parameters of
NLP models within the FedNLP framework.
F The System Design of FedNLP types of DataManager according to the task def-
inition. Users can customize their DataManager
The FedNLP platform consists of three layers: by inheriting one of the DataManager class, spec-
the application layer, the algorithm layer, and ifying data operation functions, and embedding a
the infrastructure layer. At the application layer, particular preprocessor. Note that the raw data’s
FedNLP provides three modules: data manage- H5Py file and the non-IID partition file are pre-
ment, model definition, and single-process trainer processed offline, while DataManager only loads
for all task formats; At the algorithm layer, them in runtime.
FedNLP supports various FL algorithms; At the
infrastructure layer, FedNLP aims at integrating Model Definition. We support two types of
single-process trainers with a distributed learning models: Transformer and LSTM. For Transformer
system for FL. Specifically, we make each layer models, to dock with the existing NLP ecology,
and module perform its own duties and have a high our framework is compatible with the Hugging-
degree of modularization. Face Transformers library (Wolf et al., 2020), so
that various types of Transformers can be directly
F.1 Overall Workflow reused without the need for re-implementation.
The module calling logic flow of the whole frame- Specifically, our code is compatible with the three
work is shown on the left of Figure 9. When main classes of Tokenizer, Model, and Config
we start the federated training, we first complete in HuggingFace. Users can also customize them
the launcher script, device allocation, data load- based on HuggingFace’s code. Although LSTM
ing, and model creation, and finally call the API has gradually deviated from the mainstream, we
of the federated learning algorithm. This process still support LSTM to reflect the framework’s in-
is expressed in Python-style code (see Alg. 2). tegrity, which may meet some particular use cases
in a federated setting.
F.2 The Application Layer NLP Trainer (single process perspective). As
Data Management. In data management, What for the task-specific NLP Trainer, the most
DataManager does is control the whole workflow prominent feature is that it does not require users
from loading data to returning trainable features. to have any background in distributed comput-
To be specific, DataManager is set up for read- ing. Users of FedNLP only need to complete
ing h5py data files and driving a preprocessor single-process code writing. A user should in-
to convert raw data to features. There are four herit the Trainer class in the application layer
Figure 9: The overall workflow and system design of the proposed FedNLP platform.
to implement the four methods as shown in the learning an NLP model on a dataset.
figure: 1. the get_model_params() interface al-
lows the algorithm layer to obtain model param- FedAvg (McMahan et al., 2017a) is the de facto
eters and transmit them to the server; 2. the method for federated learning, assuming both
set_model_params() interface obtains the up-
client and server use the SGD optimizer for up-
dated model from the server’s aggregation and dating model weights.
then updates the model parameters of the local FedProx (Li et al., 2020b) can tackle statistical
model; 3. the programming of the train() and heterogeneity by restricting the local model up-
test() function only needs to consider the data dates to be closer to the initial (global) model with
of a single user, meaning that the trainer is com- L2 regularization for better stability in training.
pletely consistent with the centralized training.
FedOPT (Reddi et al., 2021) is a generalized
F.3 The Algorithm Layer version of FedAvg. There are two gradient-based
In the design of the algorithm layer, we follow optimizers in the algorithm: ClientOpt and
the principle of one-line API. The parameters of ServerOpt (please refer to the pseudo code in
the API include model, data, and single-process the original paper (Reddi et al., 2021)). While
trainer (as shown in Algorithm 2). The algorithms ClientOpt is used to update the local models,
we support include: SerevrOpt treats the negative of aggregated lo-
cal changes −∆(t) as a pseudo-gradient and ap-
Centralized Training. We concatenate all client plies it on the global model. In our FedNLP frame-
datasets and use the global data DG to train a work, by default, we set the ClientOpt to be
global model — i.e., the conventional protocol for AdamW (Loshchilov and Hutter, 2019) and the
SerevrOpt to be SGD with momentum (0.9) ??). The main idea of LightSecAgg are that
and fix server learning rate as 1.0. each user protects its local model using a locally
Each algorithm includes two core objects, generated random mask. This mask is then en-
ServerManager and ClientManager, which in- coded and shared with other users, in such a way
tegrate the communication module ComManager that the aggregate mask of any sufficiently large
from the infrastructure layer and the Trainer of set of surviving users can be directly reconstructed
the training engine to complete the distributed al- at the server. Our main effort in FedNLP is in-
gorithm protocol and edge training. Note that tegrating these algorithms, optimizing its system
users can customize the Trainer by passing a cus- performance, and designing user-friendly APIs to
tomized Trainer through the algorithm API. make them compatible with NLP models and FL
algorithms.
F.4 The Infrastructure Layer
The infrastructure layer includes three modules:
1) Users can write distributed scripts to man-
age GPU resource allocation. In particular,
FedNLP provides the GPU assignment API
(map_process_to_gpu() in Algorithm 2) to as-
sign specific GPUs to different FL Clients.
2) The algorithm layer can use a unified and ab-
stract ComManager to complete a complex al-
gorithmic communication protocol. Currently,
we support MPI (Message Passing Interface),
RPC (Remote procedure call), and MQTT (Mes-
sage Queuing Telemetry Transport) communica-
tion backend. MPI meets the distributed training
needs in a single cluster; RPC meets the communi-
cation needs of cross-data centers (e.g., cross-silo
federated learning); MQTT can meet the commu-
nication needs of smartphones or IoT devices.
3) The third part is the training engine, which
reuses the existing deep learning training engines
by presenting as the Trainer class. Our cur-
rent version of this module is built on PyTorch,
but it can easily support frameworks such as
TensorFlow. In the future, we may consider sup-
porting the lightweight edge training engine opti-
mized by the compiler technology at this level.