Vector Semantics 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/vec...

CS 440/ECE 448
Fall 2020 Vector Semantics 4
Margaret Fleck

The word2vec (aka skip-gram) algorithm is a newer algorithm that does normalization and
dimensionality reduction in one step. That is, it learns how to embed words into an
n-dimensional feature space. The algorithm was introduced in a couple papers by Mikolov
et al in 2013. However, it's pretty much impossible to understand the algorithm from those
papers, so the following derivation is from later papers by Yoav Goldberg and Omer Levy.
Two similar embedding algorithms (which we won't cover) are the CBOW algorithm from
word2vec and the GloVe algorithm from Stanford.

Overview of word2vec
In broad outline word2vec has three steps

Gather pairs (w,c) where w is our focus word and c is a nearby context word,
Randomly embed all words into R
k

Iteratively adjust the embedding so that pairs (w,c) end up with similar coordinates.

A pair (w,c) is considered "similar" if the dot product w ⋅ c is large.

For example, our initial random embedding might put "elephant" near "matchbox" and far
from "rhino." Our iterative adjustment would gradually move the embedding for "rhino"
nearer to the embedding for "elephant."

Two obvious parameters to select. The dimension of the embedding space would depend on
how detailed a representation the end user wants. The size of the context window would be
tuned for good performance.

Negative sampling
Problem: a great solution for the above optimization problem is to map all words to the
same embedding vector. That defeats the purpose of having word embeddings.

Suppose we had negative examples (w,c'), in which c' is not a likely context word for w. Then
we could make the optimization move w and c' away from each other. Revised problem: our
data doesn't provide negative examples.

Solution: Train against random noise, i.e. randomly generate "bad" context words for each
focus word.

Called "negative sampling" (not a very evocative term)

1 of 3 5/10/21, 02:03
CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/vec...

For a fixed focus word w, negative context words are picked with a probability based
on how often words occur in the training data.
Depends on good example pairs being relatively sparse in the space of all word pairs
(probably correct)

The negative training data will be corrupted by containing some good examples, but this
corruption should be a small percentage of the negative training data.

Two embeddings
Another mathematical problem happens when we consider a word w with itself as context.
The dot product w ⋅ w ls large, by definition. But, for most words, w is not a common
context for itself, e.g. "dog dog" is not very common compared to "dog." So setting up the
optimization process in the obvious way causes the algorithm to want to move w away from
itself, which it obviously cannot do.

To avoid this degeneracy, word2vec builds two embeddings of each word w, one for w seen
as a focus word and one for w used as a context word. The two embeddings are closely
connected, but not identical. The final representation for w will be a concatenation of the
two embedding vectors.

Details of algorithm
Our classifier tries to predict yes/no from pairs (w,c), where "yes" means c is a good context
word for w.

We'd like to make this into a probabilistic model. So we run dot product through a sigmoid
to produce a "probability" that (w,c) is a good word-context pair. These numbers probably
aren't the actual probabilities, but we're about to treat them as if they are. That is, we
approximate

P (good|w, c) ≈ σ(w ⋅ c).


By the magic of exponentials (see below), this means that the probability that (w,c) is not a
good word-context pair is

P (bad|w, c) ≈ σ(−w ⋅ c).


Now we switch to log probabilities, so we can use addition rather than multiplication. So
we'll be looking at e.g.

log(P (good|w, c)) ≈ log(σ(w ⋅ c))


Suppose that D is the positive training pairs and D' is the set of negative training pairs. So

2 of 3 5/10/21, 02:03
CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/vec...

our iterative refinement algorithm adjusts the embeddings (both context and word) so as to
maximize

∑(w,c)∈D log(σ(w ⋅ c)) + ∑(w,c)∈D′ log(σ(−w ⋅ c))

That is, each time we read a new focus word w from the training data, we

find its context words,


generate some negative context words, and
adjust the embeddings of w and the context words in the direction that increase the
above sum.

Unexplained exponentials
Ok, so how did we get from P (good|w, c) = σ(w ⋅ c) to P (bad|w, c) = σ(−w ⋅ c)?
"Good" and "bad" are supposed to be opposites. So P (bad|w, c) = σ(−w ⋅ c) should be
equal to 1 − P (good|w, c) = 1 − σ(w ⋅ c). I claim that 1 − σ(w ⋅ c) = σ(−w ⋅ c) .

This claim actually has nothing to do with the dot products. As a general thing
1 − σ(x) = σ(−x). Here's the math.
1
Recall the definition of the sigmoid function: σ(x) = 1+e−x
.

1 e−x
1 − σ(x) = 1 − =  (add the fractions)
1 + e−x 1 + e−x
1
= −x
(divide top and bottom by e−x )
1/e + 1
1
= x (what does a negative exponent mean?)
e +1
1
= = σ(−x)
1 + ex

3 of 3 5/10/21, 02:03

You might also like