Going Beyond T-SNE: Exposing Whatlies in Text Embeddings

Going Beyond T-SNE: Exposing whatlies in Text Embeddings
Vincent D. Warmerdam Thomas Kober Rachael Tatman

Rasa Rasa Rasa
Schönhauser Allee 175 Schönhauser Allee 175 Schönhauser Allee 175
10119 Berlin 10119 Berlin 10119 Berlin
[email protected] [email protected] [email protected]
Abstract
We introduce whatlies, an open source
toolkit for visually inspecting word and sen-
arXiv:2009.02113v1 [cs.CL] 4 Sep 2020
tence embeddings. The project offers a unified

and extensible API with current support for a
range of popular embedding backends includ-
ing spaCy, tfhub, huggingface transformers,
gensim, fastText and BytePair embeddings.
The package combines a domain specific lan-
guage for vector arithmetic with visualisation
tools that make exploring word embeddings
more intuitive and concise. It offers support
for many popular dimensionality reduction
techniques as well as many interactive visual-
isations that can either be statically exported Figure 1: Projections of wking , wqueen , wman , wqueen − wking
or shared via Jupyter notebooks. The project and wman projected away from wqueen − wking . Both the
vector arithmetic and the visualisation were done using the
documentation is available from https:// whatlies. The support for arithmetic expressions is integral
rasahq.github.io/whatlies/. in whatlies because it leads to more meaningful visualisa-
tions and concise code.
1 Introduction
The use of pre-trained word embeddings (Mikolov of how representations for queen, king, man, and
et al., 2013a; Pennington et al., 2014) or language woman can be projected along the axes vqueen−king
model based sentence encoders (Peters et al., 2018; and vman|queen−king in order to derive a visualisation
Devlin et al., 2019) has become a ubiquitous part of the space along the projections.
of NLP pipelines and end-user applications in both The perhaps most widely known tool for visu-
industry and academia. At the same time, a grow- alising embeddings is the tensorflow projector1
ing body of work has established that pre-trained which offers 3D visualisations of any input em-
embeddings codify the underlying biases of the beddings. The visualisations are useful for under-
text corpora they were trained on (Bolukbasi et al., standing the emergence of clusters and the neigh-
2016; Garg et al., 2018; Brunet et al., 2019). Hence, bourhood of certain words and the overall space.
practitioners need tools to help select which set of However, the projector is limited to dimensionality
embeddings to use for a particular project, detect reduction as the sole preprocessing method. More
potential need for debiasing and evaluate the debi- recently, Molino et al. (2019) have introduced par-
ased embeddings. Simplified visualisations of the allax which allows explicit selection of the axes
latent semantic space provide an accessible way to on which to project a representation. This creates
achieve this. an additional layer of flexibility as these axes can
Therefore we created whatlies, a toolkit of- also be derived from arithmetic operations on the
fering a programmatic interface that supports vec- embeddings.
tor arithmetic on a set of embeddings and visual- The major difference between the tensorflow pro-
ising the space after any operations have been car-
1
ried out. For example, Figure 1 shows an example https://projector.tensorflow.org/
jector, parallax and whatlies is that the first two model = tf_hub + 'nnlm-en-dim50/2'
provide a non-extensible browser-based interface, lang_tf = TFHubLanguage(model)
emb_tf = lang_tf['whatlies is awesome']
whereas whatlies provides a programmatic one.
Therefore whatlies can be more easily extended # Huggingface
to any specific practical need and cover individ- bert = 'bert-base-cased'
lang_hf = HFTransformersLanguage(bert)
ual use-cases. The goal of whatlies is to of- emb_hf = lang['whatlies rocks']
fer a set of tools that can be used from a Jupyter
notebook with a range of visualisation capabili- In order to retrieve a sentence representa-
ties that goes beyond the commonly used static tion for word-level embeddings such as fastText,
T-SNE (van der Maaten and Hinton, 2008) plots. whatlies returns the summed representation of
whatlies can be installed via pip, the code the individual word vectors. For pre-trained en-
is available from https://github.com/RasaHQ/ coders such as BERT (Devlin et al., 2019) or Con-
whatlies2 and the documentation is hosted at veRT (Henderson et al., 2019), whatlies uses its
https://rasahq.github.io/whatlies/. internal [CLS] token for representing a sentence.
Similarity Retrieval. The library also supports
2 What lies in whatlies — Usage and
retrieving similar items on the basis of a number of
Examples
commonly used distance/similarity metrics such as
Embedding backends. The current version cosine or Euclidean distance:
of whatlies supports word-level as well as
from whatlies.language import \
sentence-level embeddings in any human language SpacyLanguage
that is supported by the following libraries:
lang = SpacyLanguage('en_core_web_md')
• BytePair embeddings (Sennrich et al., 2016)
lang.score_similar("man", n=5,
via the BPemb project (Heinzerling and metric='cosine')
Strube, 2018) [(Emb[man], 0.0),
(Emb[woman], 0.2598254680633545),
• fastText (Bojanowski et al., 2017) (Emb[guy], 0.29321062564849854),
(Emb[boy], 0.2954298257827759),
(Emb[he], 0.3168887495994568)]
• gensim (Řehůřek and Sojka, 2010) # NB: Results are cosine _distances_
• huggingface (Wolf et al., 2019) Vector Arithmetic. Support of arithmetic ex-
• sense2vec (Trask et al., 2015); via spaCy pressions on embeddings is integral in any
whatlies functions. For example the code for
• spaCy3 creating Figure 1 from the Introduction highlights
that it does not make a difference whether the plot-
• tfhub4 ting functionality is invoked on an embedding itself
Embeddings are loaded via a unified API: or on a representation derived from an arithmetic
operation:
SpacyLanguage, FasttextLanguage, \ import matplotlib.pylab as plt
TFHubLanguage, HFTransformersLangauge from whatlies import Embedding
# spaCy man = Embedding("man", [0.5, 0.1])

lang_sp = SpacyLanguage('en_core_web_md') woman = Embedding("woman", [0.5, 0.6])
emb_king = lang_sp["king"] king = Embedding("king", [0.7, 0.33])
emb_queen = lang_sp["queen"] queen = Embedding("queen", [0.7, 0.9])
man.plot(kind="arrow", color="blue")
# fastText woman.plot(kind="arrow", color="red")
ft = 'cc.en.300.bin' king.plot(kind="arrow", color="blue")
lang_ft = FasttextLanguage(ft) queen.plot(kind="arrow", color="red")
emb_ft = lang_ft['pizza'] diff = (queen - king)
orth = (man | (queen - king))
# TF-Hub
tf_hub = 'https://tfhub.dev/google/' diff.plot(color="pink",
2
show_ops=True)
Community PRs are greatly appreciated ,. orth.plot(color="pink",
3
https://spacy.io/ show_ops=True)
4
https://www.tensorflow.org/hub # See Figure 1 for the result :)
This feature allows users to construct custom While for Spanish, the correct answer reina is
queries and use it e.g. in combination with the sim- only at rank 3 (excluding rey from the list), the
ilarity retrieval functionality. For example, we can second ranked monarca (female form of monarch)
validate the widely circulated analogy of Mikolov is getting close. For Dutch, the correct answer
et al. (2013b) on spaCy’s medium English model koningin is at rank 2, surpassed only by koningen
in only 4 lines of code (including imports): (plural of king). Another interesting observation
is that the cosine distances — even of the query
wqueen ≈ wking − wman + wwoman words — vary wildly in the embeddings for the two
from whatlies.language import \ languages.
SpacyLanguage
Sets of Embeddings. In the previous examples
lang = SpacyLanguage('en_core_web_md') we have typically only retrieved single embeddings.
> e = lang["king"] - lang["man"] + \ However, whatlies also supports the notion of
lang["woman"] an “Embedding Set”, that can hold any number of
> lang.score_similar(e, n=5,
metric='cosine')
embeddings:
[(Emb[king], 0.19757413864135742), from whatlies.language import \
(Emb[queen], 0.2119154930114746), SpacyLanguage
(Emb[prince], 0.35989218950271606),
(Emb[princes], 0.37914562225341797), lang = SpacyLanguage("en_core_web_lg")
(Emb[kings], 0.37914562225341797)]
words = ["prince", "princess", "nurse",
Excluding the query word king5 , the analogy re- "doctor", "man", "woman",
turns the anticipated result: queen. "sentences also embed"]
# NB: 'sentences also embed' will be
Multilingual Support. whatlies supports # represented as the sum of the
#. 3 individual words.
any human language that is available from its cur-
rent list of supported embedding backends. This emb = lang[words]
allows us to check the royal analogy from above in
It is often more useful to analyse a set of em-
languages other than English. The code snippet be-
beddings at once, rather than many individual ones.
low shows the results for Spanish and Dutch, using
Therefore, any arithmetic operations that can be
pre-trained fastText embeddings6 .
applied to single embeddings, can also be applied
from whatlies.language import \ to all of the embeddings in a given set.
FasttextLanguage
es = FasttextLanguage("cc.es.300.bin") The emb variable in the previous code example
nl = FasttextLanguage("cc.nl.300.bin") represents an EmbeddingSet. These are col-
lections of embeddings which can be simpler to
emb_es = es["rey"] - es["hombre"] + \
es["mujer"] analyse than many individual variables. Users can,
emb_nl = nl["koning"] - nl["man"] + \ for example, apply vector arithmetic to the entire
nl["vrouw"]
EmbeddingSet.
es.score_similar(emb_es, n=5, new_emb = emb | (emb['man'] - emb['woman'])
metric='cosine')
[(Emb[rey], 0.04499000310897827), Visualisation Tools. Any visualisations in
(Emb[monarca], 0.24673408269882202),
(Emb[Rey], 0.2799408435821533), whatlies are most useful when preformed
(Emb[reina], 0.2993239760398865), on EmbeddingSets. They offer a variety of
(Emb[prncipe], 0.3025314211845398)] methods for plotting, such as the distance map in
nl.score_similar(emb_nl, n=5, Figure 2:
metric='cosine')
words = ['man', 'woman', 'king', 'queen',
'red', 'green', 'yellow']
[(Emb[koning], 0.48337286710739136),
emb = lang[words]
(Emb[koningen], 0.5858825445175171),
emb.plot_distance(metric='cosine')
(Emb[koningin], 0.6115483045578003),
(Emb[Koning], 0.6155656576156616),
(Emb[kroonprins], 0.658723771572113)] whatlies also offers interactive visualisations
5
using “Altair” as a plotting backend7 :
As appears to be standard practice in word analogy evalu-
ation (Levy and Goldberg, 2014). 7
Examples of the interactive visualisations can be seen
6
The embeddings are available from https:// on the project’s github page: https://github.com/
fasttext.cc/docs/en/crawl-vectors.html. RasaHQ/whatlies
e = emb["man"] - emb["woman"]
emb.plot_interactive(x_axis=e,
y_axis="yellow",
show_axis_point=True)
Figure 2: Pairwise distances for a set of words using cosine

distance.
emb.plot_interactive(x_axis="man",
y_axis="yellow",
show_axis_point=True)
The above code snippet projects every vector

in the EmbeddingSet onto the vectors on the
specified axes. This creates the values we can use Figure 4: Plotting example terms along the transformed man
- woman axis and the yellow axis.
for 2D visualisations. For example, given that man
is on the x-axis the value for ‘yellow‘ on that axis
will be: Transformations. whatlies also supports
several techniques for dimensionality reduction of
wyellow · wman EmbeddingSets prior to plotting. This is demon-
v(yellow → man) =
wman · wman strated in Figure 5 below.
which results in Figure 3. from whatlies.transformers import Pca
from whatlies.transformers import Umap
p1 = (emb
.transform(Pca(2))
.plot_interactive("pca_0",
"pca_1"))
p2 = (emb
.transform(Umap(2))
.plot_interactive("umap_0",
"umap_1"))
p1 | p2
Figure 3: Plotting example terms along the axes man vs.

yellow.
These plots are fully interactive. It is possible Figure 5: Demonstration of PCA and UMAP transforma-
to click and drag in order to navigate through the tions.
embedding space and zoom in and out. These plots
can be hosted on a website but they can also be Transformations in whatlies are slightly dif-
exported to png/svg for publication. It is fur- ferent than for example scikit-learn transforma-
thermore possible to apply any vector arithmetic tions because in addition to dimensionality reduc-
operations for these plots, resulting in Figure 4: tion, the transformation can also add embeddings
that represent each principal component to the Subsequently, the new EmbeddingSet can be
EmbeddingSet object. As a result, they can be visualised as a distance map as in Figure 6, reveal-
referred to as axes for creating visualisations as ing a number of spurious correlations that suggest
seen in Figure 5. a gender bias in the embedding space.
Scikit-Learn Integration. To facilitate quick ex- emb_of_pairs.plot_distance(metric="cosine")

ploration of different word embeddings we have
also made our library compatible with scikit- Visualising issues in the embedding space like
learn (Pedregosa et al., 2011). The Rasa library this creates an effictive way to communicate po-
uses numpy (Oliphant, 2006) to represent the nu- tential risks of using embeddings in production to
merical vectors associated to the input text. This non-technical stakeholders.
means that it is possible to use the whatlies
embedding backends as feature extractors in scikit-
learn pipelines, as the code snippet below shows8 :

BytePairLanguage
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("embed", BytePairLanguage("en")),
("model", LogisticRegression())
])
X = [
"i really like this post",
"thanks for that comment",
"i enjoy this friendly forum",
"this is a bad post", Figure 6: Distance map for visualising bias. If there was no
"i dislike this article", bias then we would expect ‘she-he‘ to have a distance near 1.0
"this is not well written" compared to ‘nurse-physician‘. The figure shows this is not
] the case.
y = np.array([1, 1, 1, 0, 0, 0])
It is possible to apply the debiasing technique
pipe.fit(X, y).predict(X) introduced by Bolukbasi et al. (2016) in order to
approximately remove the direction corresponding
This feature enables fast exploration of many to gender. The code snippet below achieves this by,
different word embedding algorithms.9 again, using the arithmetic notation.
3 A Tale of two Use-cases lang = SpacyLanguage("en_core_web_lg")
emb = lang[words]
Visualising Bias. One use-case of whatlies axis = EmbeddingSet(
is to gain insight into bias-related issues in an em- (lang['man'] - lang['woman']),
bedding space. Because the library readily sup- (lang['king'] - lang['queen']),
(lang['father'] - lang['mother'])
ports vector arithmetic it is possible to create an ).average()
EmbeddingSet holding pairs of representations: emb_debias = emb | axis
lang = SpacyLanguage("en_core_web_lg") Figure 7 shows the result of applying the de-

biasing technique, highlighting that some of the
emb_of_pairs = EmbeddingSet(
(lang["nurse"] - lang["doctor"]), spurious correlations have indeed been removed.
(lang["nurse"] - lang["surgeon"]), It is important to note though, that the above
(lang["woman"] - lang["man"]),
)
technique does not reliably remove all relevant bias
in the embeddings and that bias is still measur-
8
Note that this is an illustrative example and we do not ably existing in the embedding space as Gonen and
recommend to train and test on the same data. Goldberg (2019) have shown. This can be verified
9
At the moment, however, it is not yet possible to use
the whatlies embeddings in conjunction with scikit-learn’s with whatlies, by plotting the neighbours of the
grid search functionality. biased and debiased space:
Figure 8: Side-by-side comparison of spaCy and ConveRT
for embedding example sentences from 3 different intent
classes. ConveRT embeds the sentences into relatively tight
and coherent clusters, whereas class boundaries are more dif-
ficult to see with spaCy.
Figure 7: Distance map for visualising the embedding space
after the debiasing technique of Bolukbasi et al. (2016) has
been applied.
spaCy for this example is expected, though, as Con-
veRT is aimed at dialogue, but it is certainly useful
emb.score_similar("maid", n=7) to have a tool — whatlies — at one’s disposal
with which it is possible to quickly validate this.
[(Emb[maid], 0.0),
(Emb[maids], 0.18290925025939941),
(Emb[housekeeper], 0.2200336456298828), 4 Roadmap
(Emb[maidservant], 0.3770867586135864),
(Emb[butler], 0.3822709918022156), whatlies is in active development. While we
(Emb[mistress], 0.3967094421386719), cannot predict the contents of future community
(Emb[servant], 0.40112364292144775)] PRs, this is our current roadmap for future devel-
emb_debias.score_similar("maid", n=7) opment:
[(Emb[maid], 0.0), • We want to make it easier for people to re-

(Emb[maids], 0.18163418769836426), search bias in word embeddings. We will con-
(Emb[housekeeper], 0.21881639957427979),
(Emb[butler], 0.3642127513885498),
tinue to investigate if there are visualisation
(Emb[maidservant], 0.3768376111984253), techniques that can help spot issues and we
(Emb[servant], 0.382546067237854), aim to make any robust debiasing techniques
(Emb[mistress], 0.3955296277999878)]
available in whatlies.
As the output shows, the neighbourhoods of • We would like to curate labelled sets of word
maid in the biased and debiased space are almost lists for attempting to quantify the amount of
equivalent, with e.g. mistress still appearing rela- bias in a given embedding space. Properly la-
tively high-up the nearest neighbours list. belled word lists can be useful for algorithmic
Comparing Embedding Backends. Another bias research but it might also help understand
use-case for whatlies is for comparing different clusters. We plan to make any evaluation re-
embeddings. For example, we wanted to analyse sources available via this package.
two different encoders for their ability to capture
• One limit of using Altair as a visualisation
the intent of user utterances in a task-based dia-
library is that we cannot offer interactive visu-
logue system. We compared spaCy and ConveRT
alisations with many thousands of data points.
for their ability to embed sentences from the same
We might explore other visualisation tools for
intent class close together in space. Figure 8 shows
this library as well.
that the utterances encoded with ConveRT form
tighter and coherent clusters. • Since we’re supporting dynamic backends like
Figure 9 highlights the same trend with a dis- BERT at the sentence level, we are aiming to
tance map, where for spaCy there is barely any also support these encoders at the word level,
similarity between the utterances, the coherent clus- which requires us to specify an API for re-
ters from Figure 8 are well reflected in the distance trieving contextualised word representations
map for ConveRT. within whatlies. We are currently explor-
The superiority of ConveRT in comparison to ing various ways for exposing this feature and
Figure 9: Side-by-side comparison of spaCy and ConveRT for embedding example sentences from 3 different intent classes.
The distance map highlights the “clustery” behaviour of ConveRT, where class membership is nicely reflected in the intra-class
distances. For spaCy on the other hand, there is less difference between intra-class vs. inter-class distances.
are working with a notation that uses square the help of outside contributions. In particular we’d
brackets that can select an embedding from like to thank Masoud Kazemi for many contribu-
the context of the sentence that it resides in: tions to the project.
mod_name = "en_trf_robertabase_lg" We would furthermore like to thank Adam Lopez
lang = SpacyLanguage(mod_name) for many rounds of discussion that considerably
emb1 = lang['[bank] of the river'] improved the paper.
emb2 = lang['money on the [bank]']
assert emb1.vector != emb2.vector
At the moment we only support spaCy back- References

ends with this notation but we plan to explore Piotr Bojanowski, Edouard Grave, Armand Joulin, and
this further with other embedding backends.10 Tomas Mikolov. 2017. Enriching word vectors with
subword information. Transactions of the Associa-
5 Conclusion tion for Computational Linguistics, 5:135–146.
We have introduced whatlies, a python library Tolga Bolukbasi, Kai-Wei Chang, James Zou,
Venkatesh Saligrama, and Adam Kalai. 2016.
for inspecting word and sentence embeddings that Man is to computer programmer as woman is to
is very flexible due to offering a programmable homemaker? debiasing word embeddings. In Pro-
interface. We currently support a variety of em- ceedings of the 30th International Conference on
bedding models, including fastText, spaCy, BERT, Neural Information Processing Systems, NIPS’16,
pages 4356–4364, USA. Curran Associates Inc.
or ConveRT. This paper has showcased its cur-
rent use as well as plans for future development. Marc-Etienne Brunet, Colleen Alkalay-Houlihan,
The project is hosted at https://github.com/ A. Anderson, and R. Zemel. 2019. Understanding
RasaHQ/whatlies and we are happy to receive the origins of bias in word embeddings. In ICML.
community contributions that extend and improve Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
the package. Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
Acknowledgements standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
Despite being only a few months old the project has for Computational Linguistics: Human Language
started getting traction on github and has attracted Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
10
Ideally we also introduce the necessary notation for re- ation for Computational Linguistics.
trieving the contextualised embedding from a particular layer,
e.g. lang['bank'][2] for obtaining the representation of Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and
bank from the second layer of the given language model. James Zou. 2018. Word embeddings quantify
100 years of gender and ethnic stereotypes. Pro- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-
ceedings of the National Academy of Sciences, fort, Vincent Michel, Bertrand Thirion, Olivier
115(16):E3635–E3644. Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Vincent Dubourg, Jake Vanderplas, Alexan-
Hila Gonen and Yoav Goldberg. 2019. Lipstick on a dre Passos, David Cournapeau, Matthieu Brucher,
pig: Debiasing methods cover up systematic gender Matthieu Perrot, and Édouard Duchesnay. 2011.
biases in word embeddings but do not remove them. Scikit-learn: Machine learning in python. Journal
In Proceedings of the 2019 Conference of the North of Machine Learning Research, 12:2825–2830.
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Jeffrey Pennington, Richard Socher, and Christopher
Volume 1 (Long and Short Papers), pages 609–614, Manning. 2014. Glove: Global vectors for word rep-
Minneapolis, Minnesota. Association for Computa- resentation. In Proceedings of the 2014 Conference
tional Linguistics. on Empirical Methods in Natural Language Process-
ing, pages 1532–1543, Doha, Qatar. Association for
Benjamin Heinzerling and Michael Strube. 2018. Computational Linguistics.
BPEmb: Tokenization-free pre-trained subword em-
beddings in 275 languages. In Proceedings of the Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Eleventh International Conference on Language Re- Gardner, Christopher Clark, Kenton Lee, and Luke
sources and Evaluation (LREC 2018), Miyazaki, Zettlemoyer. 2018. Deep contextualized word repre-
Japan. European Language Resources Association sentations. In Proceedings of the 2018 Conference
(ELRA). of the North American Chapter of the Association
for Computational Linguistics: Human Language
Matthew Henderson, Iñigo Casanueva, Nikola Technologies, Volume 1 (Long Papers), pages 2227–
Mrkvsi’c, Pei hao Su, Tsung-Hsien, and Ivan Vulic. 2237. Association for Computational Linguistics.
2019. Convert: Efficient and accurate conversa-
tional representations from transformers. ArXiv, Radim Řehůřek and Petr Sojka. 2010. Software Frame-
abs/1911.03688. work for Topic Modelling with Large Corpora. In
Proceedings of the LREC 2010 Workshop on New
Omer Levy and Yoav Goldberg. 2014. Linguistic Challenges for NLP Frameworks, pages 45–50, Val-
regularities in sparse and explicit word representa- letta, Malta. ELRA.
tions. In Proceedings of the Eighteenth Confer-
ence on Computational Natural Language Learning, Rico Sennrich, Barry Haddow, and Alexandra Birch.
pages 171–180, Ann Arbor, Michigan. Association 2016. Neural machine translation of rare words
for Computational Linguistics. with subword units. In Proceedings of the 54th An-
nual Meeting of the Association for Computational
Laurens van der Maaten and Geoffrey Hinton. 2008. Linguistics (Volume 1: Long Papers), pages 1715–
Visualizing data using t-SNE. Journal of Machine 1725, Berlin, Germany. Association for Computa-
Learning Research, 9:2579–2605. tional Linguistics.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Andrew Trask, Phil Michalak, and John Liu. 2015.
Corrado, and Jeff Dean. 2013a. Distributed repre- sense2vec - a fast and accurate method for word
sentations of words and phrases and their composi- sense disambiguation in neural word embeddings.
tionality. In C.J.C. Burges, L. Bottou, M. Welling, ArXiv, abs/1511.06388.
Z. Ghahramani, and K.Q. Weinberger, editors, Ad-
vances in Neural Information Processing Systems Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
26, pages 3111–3119. Curran Associates, Inc. Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. icz, and Jamie Brew. 2019. Huggingface’s trans-
2013b. Linguistic regularities in continuous space formers: State-of-the-art natural language process-
word representations. In Proceedings of the 2013 ing. ArXiv, abs/1910.03771.
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 746–751, Atlanta,
Georgia. Association for Computational Linguistics.
Piero Molino, Yang Wang, and Jiawei Zhang. 2019.

Parallax: Visualizing and understanding the seman-
tics of embedding spaces via algebraic formulae.
In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics: System
Demonstrations, pages 165–180, Florence, Italy. As-
sociation for Computational Linguistics.
Travis E Oliphant. 2006. A guide to NumPy, volume 1.

Trelgol Publishing USA.

Going Beyond T-SNE: Exposing Whatlies in Text Embeddings

Uploaded by

Copyright:

Available Formats

Going Beyond T-SNE: Exposing Whatlies in Text Embeddings

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Going Beyond T-SNE: Exposing Whatlies in Text Embeddings

Uploaded by

Copyright:

Available Formats

Going Beyond T-SNE: Exposing whatlies in Text Embeddings

Vincent D. Warmerdam Thomas Kober Rachael Tatman

tence embeddings. The project offers a unified

# spaCy man = Embedding("man", [0.5, 0.1])

Figure 2: Pairwise distances for a set of words using cosine

The above code snippet projects every vector

Figure 3: Plotting example terms along the axes man vs.

Scikit-Learn Integration. To facilitate quick ex- emb_of_pairs.plot_distance(metric="cosine")

from whatlies.language import \

3 A Tale of two Use-cases lang = SpacyLanguage("en_core_web_lg")

lang = SpacyLanguage("en_core_web_lg") Figure 7 shows the result of applying the de-

[(Emb[maid], 0.0), • We want to make it easier for people to re-

At the moment we only support spaCy back- References

Piero Molino, Yang Wang, and Jiawei Zhang. 2019.

Travis E Oliphant. 2006. A guide to NumPy, volume 1.

You might also like