Going Beyond T-SNE: Exposing Whatlies in Text Embeddings
Going Beyond T-SNE: Exposing Whatlies in Text Embeddings
Going Beyond T-SNE: Exposing Whatlies in Text Embeddings
Abstract
We introduce whatlies, an open source
toolkit for visually inspecting word and sen-
arXiv:2009.02113v1 [cs.CL] 4 Sep 2020
emb.plot_interactive(x_axis="man",
y_axis="yellow",
show_axis_point=True)
p1 = (emb
.transform(Pca(2))
.plot_interactive("pca_0",
"pca_1"))
p2 = (emb
.transform(Umap(2))
.plot_interactive("umap_0",
"umap_1"))
p1 | p2
These plots are fully interactive. It is possible Figure 5: Demonstration of PCA and UMAP transforma-
to click and drag in order to navigate through the tions.
embedding space and zoom in and out. These plots
can be hosted on a website but they can also be Transformations in whatlies are slightly dif-
exported to png/svg for publication. It is fur- ferent than for example scikit-learn transforma-
thermore possible to apply any vector arithmetic tions because in addition to dimensionality reduc-
operations for these plots, resulting in Figure 4: tion, the transformation can also add embeddings
that represent each principal component to the Subsequently, the new EmbeddingSet can be
EmbeddingSet object. As a result, they can be visualised as a distance map as in Figure 6, reveal-
referred to as axes for creating visualisations as ing a number of spurious correlations that suggest
seen in Figure 5. a gender bias in the embedding space.
pipe = Pipeline([
("embed", BytePairLanguage("en")),
("model", LogisticRegression())
])
X = [
"i really like this post",
"thanks for that comment",
"i enjoy this friendly forum",
"this is a bad post", Figure 6: Distance map for visualising bias. If there was no
"i dislike this article", bias then we would expect ‘she-he‘ to have a distance near 1.0
"this is not well written" compared to ‘nurse-physician‘. The figure shows this is not
] the case.
y = np.array([1, 1, 1, 0, 0, 0])
It is possible to apply the debiasing technique
pipe.fit(X, y).predict(X) introduced by Bolukbasi et al. (2016) in order to
approximately remove the direction corresponding
This feature enables fast exploration of many to gender. The code snippet below achieves this by,
different word embedding algorithms.9 again, using the arithmetic notation.
emb = lang[words]
Visualising Bias. One use-case of whatlies axis = EmbeddingSet(
is to gain insight into bias-related issues in an em- (lang['man'] - lang['woman']),
bedding space. Because the library readily sup- (lang['king'] - lang['queen']),
(lang['father'] - lang['mother'])
ports vector arithmetic it is possible to create an ).average()
EmbeddingSet holding pairs of representations: emb_debias = emb | axis
are working with a notation that uses square the help of outside contributions. In particular we’d
brackets that can select an embedding from like to thank Masoud Kazemi for many contribu-
the context of the sentence that it resides in: tions to the project.
mod_name = "en_trf_robertabase_lg" We would furthermore like to thank Adam Lopez
lang = SpacyLanguage(mod_name) for many rounds of discussion that considerably
emb1 = lang['[bank] of the river'] improved the paper.
emb2 = lang['money on the [bank]']
assert emb1.vector != emb2.vector
We have introduced whatlies, a python library Tolga Bolukbasi, Kai-Wei Chang, James Zou,
Venkatesh Saligrama, and Adam Kalai. 2016.
for inspecting word and sentence embeddings that Man is to computer programmer as woman is to
is very flexible due to offering a programmable homemaker? debiasing word embeddings. In Pro-
interface. We currently support a variety of em- ceedings of the 30th International Conference on
bedding models, including fastText, spaCy, BERT, Neural Information Processing Systems, NIPS’16,
pages 4356–4364, USA. Curran Associates Inc.
or ConveRT. This paper has showcased its cur-
rent use as well as plans for future development. Marc-Etienne Brunet, Colleen Alkalay-Houlihan,
The project is hosted at https://github.com/ A. Anderson, and R. Zemel. 2019. Understanding
RasaHQ/whatlies and we are happy to receive the origins of bias in word embeddings. In ICML.
community contributions that extend and improve Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
the package. Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
Acknowledgements standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
Despite being only a few months old the project has for Computational Linguistics: Human Language
started getting traction on github and has attracted Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
10
Ideally we also introduce the necessary notation for re- ation for Computational Linguistics.
trieving the contextualised embedding from a particular layer,
e.g. lang['bank'][2] for obtaining the representation of Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and
bank from the second layer of the given language model. James Zou. 2018. Word embeddings quantify
100 years of gender and ethnic stereotypes. Pro- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-
ceedings of the National Academy of Sciences, fort, Vincent Michel, Bertrand Thirion, Olivier
115(16):E3635–E3644. Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Vincent Dubourg, Jake Vanderplas, Alexan-
Hila Gonen and Yoav Goldberg. 2019. Lipstick on a dre Passos, David Cournapeau, Matthieu Brucher,
pig: Debiasing methods cover up systematic gender Matthieu Perrot, and Édouard Duchesnay. 2011.
biases in word embeddings but do not remove them. Scikit-learn: Machine learning in python. Journal
In Proceedings of the 2019 Conference of the North of Machine Learning Research, 12:2825–2830.
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Jeffrey Pennington, Richard Socher, and Christopher
Volume 1 (Long and Short Papers), pages 609–614, Manning. 2014. Glove: Global vectors for word rep-
Minneapolis, Minnesota. Association for Computa- resentation. In Proceedings of the 2014 Conference
tional Linguistics. on Empirical Methods in Natural Language Process-
ing, pages 1532–1543, Doha, Qatar. Association for
Benjamin Heinzerling and Michael Strube. 2018. Computational Linguistics.
BPEmb: Tokenization-free pre-trained subword em-
beddings in 275 languages. In Proceedings of the Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Eleventh International Conference on Language Re- Gardner, Christopher Clark, Kenton Lee, and Luke
sources and Evaluation (LREC 2018), Miyazaki, Zettlemoyer. 2018. Deep contextualized word repre-
Japan. European Language Resources Association sentations. In Proceedings of the 2018 Conference
(ELRA). of the North American Chapter of the Association
for Computational Linguistics: Human Language
Matthew Henderson, Iñigo Casanueva, Nikola Technologies, Volume 1 (Long Papers), pages 2227–
Mrkvsi’c, Pei hao Su, Tsung-Hsien, and Ivan Vulic. 2237. Association for Computational Linguistics.
2019. Convert: Efficient and accurate conversa-
tional representations from transformers. ArXiv, Radim Řehůřek and Petr Sojka. 2010. Software Frame-
abs/1911.03688. work for Topic Modelling with Large Corpora. In
Proceedings of the LREC 2010 Workshop on New
Omer Levy and Yoav Goldberg. 2014. Linguistic Challenges for NLP Frameworks, pages 45–50, Val-
regularities in sparse and explicit word representa- letta, Malta. ELRA.
tions. In Proceedings of the Eighteenth Confer-
ence on Computational Natural Language Learning, Rico Sennrich, Barry Haddow, and Alexandra Birch.
pages 171–180, Ann Arbor, Michigan. Association 2016. Neural machine translation of rare words
for Computational Linguistics. with subword units. In Proceedings of the 54th An-
nual Meeting of the Association for Computational
Laurens van der Maaten and Geoffrey Hinton. 2008. Linguistics (Volume 1: Long Papers), pages 1715–
Visualizing data using t-SNE. Journal of Machine 1725, Berlin, Germany. Association for Computa-
Learning Research, 9:2579–2605. tional Linguistics.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Andrew Trask, Phil Michalak, and John Liu. 2015.
Corrado, and Jeff Dean. 2013a. Distributed repre- sense2vec - a fast and accurate method for word
sentations of words and phrases and their composi- sense disambiguation in neural word embeddings.
tionality. In C.J.C. Burges, L. Bottou, M. Welling, ArXiv, abs/1511.06388.
Z. Ghahramani, and K.Q. Weinberger, editors, Ad-
vances in Neural Information Processing Systems Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
26, pages 3111–3119. Curran Associates, Inc. Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. icz, and Jamie Brew. 2019. Huggingface’s trans-
2013b. Linguistic regularities in continuous space formers: State-of-the-art natural language process-
word representations. In Proceedings of the 2013 ing. ArXiv, abs/1910.03771.
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 746–751, Atlanta,
Georgia. Association for Computational Linguistics.