NLP-NeuralNetworks Reading Notes
NLP-NeuralNetworks Reading Notes
NLP-NeuralNetworks Reading Notes
1
December 20, 2017
f (x) = x · W + b
x∈R din
W ∈ Rdin ×dout b ∈ Rdout
Here, the vector x is the input to the function, while the matrix W and the vector b are the parameters
• linear models have several desired properties: they are easy and efficient to train, they often result in convex
optimization objectives, the trained models are somewhat interpretable, and they are often very effective in
practice.
Linear Models
• Binary Classification: Single output. dout = 1, w is a vector and b a scalar
f (x) = x · w + b
• Sign Function: Function that maps negative values to -1 (negative class) and non-negative values to +1
(positive class)
• A dataset is (binary) linearly separable: the two classes can be separated by a straight line
• Nonlinearly separable: Not possible to separate the data-points using a straight line. Beyond the hypothesis
class of linear classifiers.
• Feature: A property by which we classify the data points
• Feature Extraction: Function that maps a real world object to a vector of measurable quantities which can
be used as inputs to the model
• Feature Engineering: The design of the feature functions
• Deep learning promises to simplify the feature-engineering process by allowing the model designer to specify
a small set of core, basic, or ”natural” features, and letting the trainable neural network architecture combine
them into more meaningful higher-level features, or representations
2
December 20, 2017
• Sigmoid Function: Maps values from [−∞, ∞] to the range [0,1], by squashing output:
1
σ=
1 + exp −x
Representations
• Representation: (of document), vector that captures the properties of document that are important
Training As Optimization
• loss function: function for quantifying the loss suffered when predicting ŷ while the true label is y.
– A loss function L(ŷ, y) assigns a numerical score (a scalar) to a predicted output ŷ given the true expected
output y.
– Bounded from below, with the minimum attained only for cases where prediction is correct.
• The parameters of the learned function (W and b) are set in order to minimize the loss L over the training
examples
• Regularization Term: A function R(Θ) that takes as input that parameters and returns a scalar that
reflects their ”complexity”. Low regularization helps counter over-fiting
• Hinge Loss: Loss is 0 when y and ỹ share the same sign and |ỹ| ≥ 1. Otherwise, the loss is linear.
• Hinge (Multi-class) Loss: Denote t = argmaxi y[i] the correct class, and by k = argmaxi6=t ŷ[i] the highest
scoring class such that k 6= t. The multi-class hinge loss is defined as:
3
December 20, 2017
• In practice, the regularizers R equate complexity with large weights, and work to keep the parameter values
low.
• Regularizers R measure the norms of the parameter matrices, and drive the learner toward solutions with low
norms
• L2 Regularization: R takes the form of the squared L2 norm of the parameters, trying to keep the sum of
the squares of the parameter values low:
• L1 Regularization: R takes the form of the L1 norm of the parameters, trying to keep the sum of the
absolute values of the parameters low.
• Elastic Net Regularization: Combines both L1 and L2 regularization in weighted sum.
• Dropout: Another form of regularization which is very effective in neural networks.
Gradient-Based Optimization
• Gradient-based methods work by repeatedly computing an estimate of the loss L over the training set, com-
puting the gradients of the parameters Θ with respect to the loss estimate, and moving the parameters in the
opposite directions of the gradient.
• Convex Function: A function whose second-derivative is always non-negative, and hence has a single
minimum point
• Concave Function: A function whose second-derivative is always non-positive, and hence has a single
maximum point
• For functions that are neither convex nor concavee, a gradient-based optimization procedure may converge to
a local extremum point, missing the global optimum.
• SGD: Stochastic Gradient Descent, a general optimization algorithm
– Goal is to set the parameters of Θ so as to minimize the total loss over the training set.
– Works by repeatedly sampling a training example and computing the gradient of the error on the example
with respect to the parameters Θ.
– The loss is treated as a function of the parameters Θ.
– The parameters of Θ are updated in the opposite direction of the gradient, scaled by a learning rate ηt
4
December 20, 2017
Kernel Methods
• SVM: (Kernelized) Support Vector Machines
• SVMS define a set of generic mappings of data into high dimensional spaces, and then perform linear classi-
fication in the transformed space.
• Polynomial Mapping: φ(x) = xd is example of mapping that gives all combinations of d variables. For
d = 2, can solve XOR.
• High-dimensional spaces can become computationally prohibitive and risk overfitting
• Kernel Trick: allows one to work in transformed space without ever computing the transformed represen-
tation. Makes classification procedure for SVMs dependent linearly on size of training set.
ŷ = φ(x)W + b
φ(x) = g(xW 0 + b0 )
• This is the main idea behind deep learning and neural networks, specifically multi-layer perceptions (MLP.
5
December 20, 2017
A Brain-Inspired Metaphor
• For every neuron, each input has an associated weight. Neuron multiplies each input by its weight, and then
sums them, applies a nonlinear function to the result, and passes it to its output
• Neurons are arranged in layers, reflecting the flow of information.
• Bottom layer has no incoming arrows and is the input
• Top-most layer has no outgoing arrows and is the output
• Other layers of network are ”hidden layers”
• Fully Connected Layer / Affine Layer: Each neuron is connected to all of he neurons in the next layer
• A feed-forward network is just a stack of linear models separated by nonlinear functions.
• The values of each row in the network can be thought of as a vector
• A fully connected layer implements a vector-matrix multiplication, h = xW where the weight of the connection
from the ith neuron in the input row to the jth neuron in the output row is W[i,j] . The values of h are then
transformed by a nonlinear function g that is applied to each value before being passed on as input to the
next layer.
In Mathematical Notation
• Perceptron: The simplest neural network.Simply a linear model:
N NP erceptron (x) = xW + b
x∈R din
W ∈ Rdin ×dout b ∈ Rdout
Representation Power
• Theoretically, don’t need to go beyond MLP1 but the dimensionality is exponentially high and learnability is
bad.
Common Nonlinearities
• Sigmoid: Transforms each value x into the range [0, 1].
• Hyperbolic Tangent (tanh): Transforms the values x into the range [−11].
• Hard tanh: Approximation of the tanh function which is faster to compute and to find derivatives thereof.
• Rectifier (ReLU): (Rectified Linear Unit), clips each value x¡0 at 0. Performs well combined with dropout
regularization.
6
December 20, 2017
Loss Functions
• L(ŷ, y) assigns a numerical score (a scalar) to the network’s output ŷ given the true expected output y.
simbilinear (u, v) = uM v
dist(u, v) = (u − v)M (u − v)
• A multi-layer perceptron with a single output neuron can also be used for producing a scalar from two vectors,
by feeding it the concatenation of the two vectors
Embedding Layers
• Embedding Layer: Layer that performs the mapping from symbolic feature values to d-dimensional vectors
(also called a lookup layer).
7
December 20, 2017
Practicalities
• The non-convexity of the objective function means the optimization procedure may get stuck in a local
minimum or a saddle point, and that starting from different initial points may result in different results.
• Random Restarts: Running the training process several times, each with a different random initialization,
and choosing the best one on the development set.
• Model Ensembles: Basing a prediction on an ensemble of several trained models, rather than on a single
one (done by taking majority vote or averaging output vectors).
• In deep networks, it is common for the error gradients to either vanish (become exceedingly close to 0) or
explode (become exceedingly high)
– Vanishing gradients are open research question, but LSTMs and GRUs assist in gradient flow
– To deal with exploding gradients, clip the gradients if their norm exceeds a given threshold.
• Saturated Neurons: Neurons with output values all close to their upper-limit and with very small gradients.
– Layers with tanh and sigmoid activations can become saturated with output values all close to 1
– Saturaed neurons are caused by too large values entering the layer
– Controlled for by changing initialization, scaling the range of the input values, or changing learning rate
– Layer normalization (normalizing output vector) will help but is computationally expensive
• Dead Neurons: Neurons with output values most or all values negative and thus clipped at zero for all
inputs, resulting in 0 gradient.
– Layers with ReLU activations can’t become saturated, but they can die.
– Caused by all signals entering the layer being negative
8
December 20, 2017
9
December 20, 2017
Random Initialization
• When enough supervised training data is available, one can just treat the feature embeddings the same as
other model parameters: initialize the embedding vectors to random values, and let the network-training
procedure tune them
• Method used by Word2Vec is to initalize the word vectors to uniformly sampled random numbers in the range
1 1
[− 2d , 2d ], where d is the number of dimensions
• In practice, random initialization is used to initialize the embedding vectors of commonly occurring features
and supervised or unsupervised pre-training is used to initialize the potentially rare features.
Unsupervised Pre-Training
• Want embedding vectors of ”similar” words to have similar vectors.
• Distributional Hypothesis: words are similar if they appear in similar contexts
• An important benefit of training word embeddings on large amounts of unannotated data is that it provides
vector representations for words that do not appear in the supervised traiing set.
• When using pre-trained embeddings, there are 2 choices:
– Should the pre-trained word vectors be used as is, or should each vector be normalized to unit length
– How do we fine-tune the pre-trained vectors for the task? Do we treat embedding matrix E as parameter
and train it?
10
December 20, 2017
– the need to condition on contexts that can be combined using the chain-rule of probability to produce
sentence-level probability estimates
• The use of randomly sampled words to produce negative examples of incorrect word-context to drive the
optimization is at the core of the Word2Vec algorithm. item For a multi-word context c1:k , the CBOW
variant of Word2Vec Pk defined the context vector c to be a sum of the embedding vectors of the context
components: c = i=1 ci . It then defines the score to be s(w, c) = w · c
• Skip-Gram: For a k-elements context c1:k , the skip-gram variant assumes that the elements ci in the context
are independent from each other, essentially treating them as k different contexts, i.e., a word-context pair
(w, ci:k ) will be represented in D as k different contexts: (w, c1 ), ..., (w, ck ).
• GLoVe Algorithm: constructs an explicit word-context matrix, and trains the word and context vectors w
and c attempting to satisfy:
where b[w] and b[c] are word-specific and context-specific trained biases.
11
December 20, 2017
Word Similarity
• to compute similarity between two words using a similarity function over vectors, use cosine similarity,
corresponding to the cosine of the angle between the vectors
Word Clustering
• Word vectors can be easily clustered.
• Clusters can then be used as features in learning algorithms that work with discrete features or require discrete
symbols.
• Similarity to all other words can be computed by the matrix-vector multiplication s = Ew. The result s is a
vector of similarities, where s[i] is the similarity of w to the ith word in the vocabulary. Can use this to find
k most similar words to a word.
• to find the most similar word to a group of words, we define similarity to a group of words to be the average
similarity to the items in the group. Can be computed as:
s = E(w1 + w2 + · · · wk )/k
Odd-One Out
• Odd-one-Out question can be answered by computing the similarity between each word to the average word
vector of the group, and returning the least similar word.
Word Analogies
• Analogy Solving Task: Different word embeddings are evaluated on their ability to answer analogy ques-
tions.
• 3CosAdd: Maximize similarity to analogy end-word, dissimilarity to analogy start-word, and similarity to
input start-word.
12
December 20, 2017
13