DL Notes

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 35

1.

List and explain application that can be modeled using RNN


2. Illustrate the Deep belief network with a neat sketch.
3. Illustrate the structure Hopfield net with a neat sketch.
4. List the different types of auto-encoders.
5. Write briefly about Boltzmann machine.
6. Discuss about sparse coding and computer vision.
7. Discuss the features of Tensor flow, caffe, Theano, Torch tools.
8. List and explain NLP packages and tools with examples.

1. What is data mining and data warehousing? Explain the primary methodologies of Data
mining.
2. Describe about prescriptive analytics and its step in the business analytics process.
3. Discuss about Qualitative and Judgmental Forecasting and statistical forecasting models.
4. Explain different forecasting models for stationary time-series and regression forecasting
with casual variables.
5. Write about Monte Carle Simulation model and cash budget model.
6. Explain the decision strategies, decision trees, and decision theory model elements
7. Write about value of information and utility and decision making.
8. Discuss about collaborative business intelligence and data storytelling and data
journalism
1. List and explain application that can be modeled using RNN
Recurrent neural networks (RNN) are the state of the art algorithm for sequential data and are used
by Apple's Siri and Google's voice search. It is the first algorithm that remembers its input, due to an
internal memory, which makes it perfectly suited for machine learning problems that involve
sequential data. It is one of the algorithms behind the scenes of the amazing achievements seen
in deep learning over the past few years. In this post, we'll cover the basic concepts of how
recurrent neural networks work, what the biggest issues are and how to solve them.

RNNs are a powerful and robust type of neural network, and belong to the most promising
algorithms in use because it is the only one with an internal memory.
Like many other deep learning algorithms, recurrent neural networks are relatively old. They were
initially created in the 1980’s, but only in recent years have we seen their true potential. An
increase in computational power along with the massive amounts of data that we now have to
work with, and the invention of long short-term memory (LSTM) in the 1990s, has really brought
RNNs to the foreground.
Because of their internal memory, RNN’s can remember important things about the input they
received, which allows them to be very precise in predicting what’s coming next. This is why
they're the preferred algorithm for sequential data like time series, speech, text, financial data,
audio, video, weather and much more. Recurrent neural networks can form a much deeper
understanding of a sequence and its context compared to other algorithms.
WHAT IS A RECURRENT NEURAL NETWORK (RNN)?
Recurrent neural networks (RNN) are a class of neural networks that are helpful in modeling
sequence data. Derived from feedforward networks, RNNs exhibit similar behavior to how human
brains function. Simply put: recurrent neural networks produce predictive results in sequential data
that other algorithms can’t.
But when do you need to use a RNN?
“Whenever there is a sequence of data and that temporal dynamics that connects the data is more
important than the spatial content of each individual frame.” – Lex Fridman (MIT)
Since RNNs are being used in the software behind Siri and Google Translate, recurrent neural
networks show up a lot in everyday life.
How Recurrent Neural Networks Work
To understand RNNs properly, you'll need a working knowledge of "normal“feed-forward neural
networks and sequential data.
Sequential data is basically just ordered data in which related things follow each other. Examples
are financial data or the DNA sequence. The most popular type of sequential data is perhaps time
series data, which is just a series of data points that are listed in time order.
RNN VS. FEED-FORWARD NEURAL NETWORKS

RNN’s and feed-forward neural networks get their names from the way they channel information.
In a feed-forward neural network, the information only moves in one direction — from the input
layer, through the hidden layers, to the output layer. The information moves straight through the
network and never touches a node twice.
Feed-forward neural networks have no memory of the input they receive and are bad at predicting
what’s coming next. Because a feed-forward network only considers the current input, it has no
notion of order in time. It simply can’t remember anything about what happened in the past except
its training.
In a RNN the information cycles through a loop. When it makes a decision, it considers the current
input and also what it has learned from the inputs it received previously.
The two images below illustrate the difference in information flow between a RNN and a feed-
forward neural network.

A usual RNN has a short-term memory. In combination with a LSTM they also have a long-term
memory (more on that later).
Another good way to illustrate the concept of a recurrent neural network's memory is to explain it
with an example:
Imagine you have a normal feed-forward neural network and give it the word "neuron" as an input
and it processes the word character by character. By the time it reaches the character "r," it has
already forgotten about "n," "e" and "u," which makes it almost impossible for this type of neural
network to predict which character would come next.
A recurrent neural network, however, is able to remember those characters because of its internal
memory. It produces output, copies that output and loops it back into the network.
Simply put: recurrent neural networks add the immediate past to the present.
Therefore, a RNN has two inputs: the present and the recent past. This is important because the
sequence of data contains crucial information about what is coming next, which is why a RNN can
do things other algorithms can’t.
A feed-forward neural network assigns, like all other deep learning algorithms, a weight matrix to
its inputs and then produces the output. Note that RNNs apply weights to the current and also to
the previous input. Furthermore, a recurrent neural network will also tweak the weights for both
through gradient descent and backpropagation through time (BPTT).
TYPES OF RNNS
 One to One
 One to Many
 Many to One
 Many to Many
Also note that while feed-forward neural networks map one input to one output, RNNs can map
one to many, many to many (translation) and many to one (classifying a voice).

Backpropagation through Time


To understand the concept of backpropagation through time you'll need to understand the
concepts of forward and backpropagation first. We could spend an entire article discussing these
concepts, so I will attempt to provide as simple a definition as possible.
WHAT IS BACKPRAPAGATION?
Backpropagation (BP or backprop, for short) is known as a workhorse algorithm in machine
learning. Backpropagation is used for calculating the gradient of an error function with respect to a
neural network’s weights. The algorithm works its way backwards through the various layers of
gradients to find the partial derivative of the errors with respect to the weights. Backprop then uses
these weights to decrease error margins when training.
In neural networks, you basically do forward-propagation to get the output of your model and
check if this output is correct or incorrect, to get the error. Backpropagation is nothing but going
backwards through your neural network to find the partial derivatives of the error with respect to
the weights, which enables you to subtract this value from the weights.
Those derivatives are then used by gradient descent, an algorithm that can iteratively minimize a
given function. Then it adjusts the weights up or down, depending on which decreases the error.
That is exactly how a neural network learns during the training process.
So, with backpropagation you basically try to tweak the weights of your model while training.
The image below illustrates the concept of forward propagation and backpropagation in a feed-
forward neural network:

BPTT is basically just a fancy buzz word for doing backpropagation on an unrolled RNN. Unrolling is
visualization and conceptual tool, which helps you understand what’s going on within the network.
Most of the time when implementing a recurrent neural network in the common programming
frameworks, backpropagation is automatically taken care of, but you need to understand how it
works to troubleshoot problems that may arise during the development process.
You can view a RNN as a sequence of neural networks that you train one after another with
backpropagation.
The image below illustrates an unrolled RNN. On the left, the RNN is unrolled after the equal sign.
Note there is no cycle after the equal sign since the different time steps are visualized and
information is passed from one time step to the next. This illustration also shows why a RNN can be
seen as a sequence of neural networks.

An unrolled version of RNN


If you do BPTT, the conceptualization of unrolling is required since the error of a given timestep
depends on the previous time step.
Within BPTT the error is backpropagated from the last to the first timestep, while unrolling all the
timesteps. This allows calculating the error for each timestep, which allows updating the weights.
Note that BPTT can be computationally expensive when you have a high number of timesteps.
Two issues of standard RNN’s
There are two major obstacles RNN’s have had to deal with, but to understand them; you first need
to know what a gradient is.
A gradient is a partial derivative with respect to its inputs. If you don’t know what that means, just
think of it like this: a gradient measures how much the output of a function changes if you change
the inputs a little bit.
You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the
slope and the faster a model can learn. But if the slope is zero, the model stops learning. A gradient
simply measures the change in all weights with regard to the change in error.
EXPLODING GRADIENTS
Exploding gradients are when the algorithm, without much reason, assigns a stupidly high
importance to the weights. Fortunately, this problem can be easily solved by truncating or
squashing the gradients.
VANISHING GRADIENTS
Vanishing gradients occur when the values of a gradient are too small and the model stops learning
or takes way too long as a result. This was a major problem in the 1990s and much harder to solve
than the exploding gradients. Fortunately, it was solved through the concept of LSTM by Sepp
Hochreiter and Juergen Schmidhuber.
2. Illustrate the Deep belief network with a neat sketch
What is Deep Belief Network?
 DBN is an Unsupervised Probabilistic Deep learning algorithm.
 DBN is composed of multi layer of stochastic latent variables. Latent variables are binary, also
called as feature detectors or hidden units
 DBN is a generative hybrid graphical model. Top two layers are undirected. Lower layers have
directed connections from layers above.
Architecture of DBN

Deep Belief Network


 It is a stack of Restricted Boltzmann Machine (RBM) or Autoencoders.
 Top two layers of DBN are undirected, symmetric connection between them that form
associative memory.
 The connections between all lower layers are directed, with the arrows pointed toward the
layer that is closest to the data. Lower Layers have directed acyclic connections that convert
associative memory to observed variables. The lowest layer or the visible units receives the
input data. Input data can be binary or real.
 There are no intra layer connections likes RBM
 Hidden units represents features that captures the correlations present in the data
 Two layers are connected by a matrix of symmetrical weights W.
 Every unit in each layer is connected to every unit in the each neighboring layer
How does DBN work?
 DBN are pre trained using Greedy learning algorithm. Greedy learning algorithm uses layer-
by-layer approach for learning the top-down, generative weights. These generative weights
determine how variables in one layer depend on the variables in the layer above.
 In DBN we run several steps of Gibbs sampling on the top two hidden layers. This stage is
essentially drawing a sample from the RBM defined by the top two hidden layers.
 Then use a single pass of ancestral sampling through the rest of the model to draw a sample
from the visible units.
 Learning, the values of the latent variables in every layer can be inferred by a single, bottom-
up pass. Greedy pretraining starts with an observed data vector in the bottom layer. It then
uses the generative weights in the reverse direction using fine tuning.
3. Illustrate the structure Hopfield net with a neat sketch
The Hopfield Neural Networks, invented by Dr John J. Hopfield consists of one layer of ‘n’ fully
connected recurrent neurons. It is generally used in performing auto association and optimization
tasks. It is calculated using a converging interactive process and it generates a different response
than our normal neural nets.
Discrete Hopfield Network: It is a fully interconnected neural network where each unit is
connected to every other unit. It behaves in a discrete manner, i.e. it gives finite distinct output,
generally of two types:
 Binary (0/1)
 Bipolar (-1/1)
The weights associated with this network are symmetric in nature and has the following
properties.
1. wij = wji 2. wii =0
Structure & Architecture
 Each neuron has an inverting and a non-inverting output.
 Being fully connected, the output of each neuron is an input to all other neurons but not
self.
Fig 1 shows a sample representation of a Discrete Hopfield Neural Network architecture having
the following elements.

Fig 1: Discrete Hopfield Network Architecture

[ x1 , x2 , ... , xn ] -> Input to the n given neurons.


[ y1 , y2 , ... , yn ] -> Output obtained from the n given neurons
Wij -> weight associated with the connection between the i th and the jth neuron.
Training Algorithm
For storing a set of input patterns S(p) [p = 1 to P], where S(p) = S1(p) … Si(p) … Sn(p), the weight
matrix is given by:
 For binary patterns

 For bipolar patterns

(i.e. weights here have no self connection)


Steps Involved
Step 1 - Initialize weights (wij) to store patterns (using training algorithm).
Step 2 - For each input vector yi, perform steps 3-7.
Step 3 - Make initial activators of the network equal to the external input vector x.

Step 4 - For each vector yi, perform steps 5-7.


Step 5 - Calculate the total input of the network yin using the equation given below.

Step 6 - Apply activation over the total input to calculate the output as per the equation given
below:

(Where θi (threshold) and is normally taken as 0)


Step 7 - Now feedback the obtained output y i to all other units. Thus, the activation vectors are
updated.
Step 8 - Test the network for convergence.
Continuous Hopfield Network: Unlike the discrete hopfield networks, here the time parameter is
treated as a continuous variable. So, instead of getting binary/bipolar outputs, we can obtain
values that lie between 0 and 1. It can be used to solve constrained optimization and associative
memory problems. The output is defined as: v i =g (ui)
Where, vi = output from the continuous hopfield network
ui = internal activity of a node in continuous hopfield network.
Energy Function
The hopfield networks have energy function associated with them. It either diminishes or remains
unchanged on update (feedback) after every iteration. The energy function for a continuous
hopfield network is defined as:

To determine if the network will converge to a stable configuration, we see if the energy function
reaches its minimum by:

The network is bound to converge if the activity of each neuron wrt time is given by the following
differential equation:
4. List the different types of auto-encoders.
Autoencoders encodes the input values x using a function f. Then decodes the encoded
values f(x) using a function g to create output values identical to the input values.
Autoencoder objective is to minimize reconstruction error between the input and output. This helps
autoencoders to learn important features present in the data. When a representation allows a good
reconstruction of its input then it has retained much of the information present in the input.
What are different types of Autoencoders?
Undercomplete Autoencoders

Undercomplete Autoencoder- Hidden layer has smaller dimension than input layer
 Goal of the Autoencoder is to capture the most important features present in the data.
 Undercomplete autoencoders have a smaller dimension for hidden layer compared to the
input layer. This helps to obtain important features from the data.
 Objective is to minimize the loss function by penalizing the g(f(x)) for being different from the
input x.

 When decoder is linear and we use a mean squared error loss function then undercomplete
autoencoder generates a reduced feature space similar to PCA
 We get a powerful nonlinear generalization of PCA when encoder function f and decoder
function g are non linear.
 Undercomplete autoencoders do not need any regularization as they maximize the
probability of data rather than copying the input to the output.
Sparse Autoencoders

Sparse Autoencoders use only reduced number of hidden nodes at a time


 Sparse autoencoders have hidden nodes greater than input nodes. They can still discover
important features from the data.
 Sparsity constraint is introduced on the hidden layer. This is to prevent output layer copy
input data.
 Sparse autoencoders have a sparsity penalty, Ω(h), a value close to zero but not zero.
Sparsity penalty is applied on the hidden layer in addition to the reconstruction error. This
prevents overfitting.

 Sparse autoencoders take the highest activation values in the hidden layer and zero out the
rest of the hidden nodes. This prevents autoencoders to use all of the hidden nodes at a
time and forcing only a reduced number of hidden nodes to be used.
 As we activate and inactivate hidden nodes for each row in the dataset. Each hidden node
extracts a feature from the data
Denoising Autoencoders (DAE) :

Denoising Autoencoders — input is corrupted


 Denoising refers to intentionally adding noise to the raw input before providing it to the
network. Denoising can be achieved using stochastic mapping.
 Denoising autoencoders create a corrupted copy of the input by introducing some noise.
This helps to avoid the autoencoders to copy the input to the output without learning
features about the data.
 Corruption of the input can be done randomly by making some of the input as zero.
Remaining nodes copy the input to the noised input.
 Denoising autoencoders must remove the corruption to generate an output that is similar to
the input. Output is compared with input and not with noised input. To minimize the loss
function we continue until convergence
 Denoising autoencoders minimizes the loss function between the output node and the
corrupted input.

 Denoising helps the autoencoders to learn the latent representation present in the data.
Denoising autoencoders ensures a good representation is one that can be derived robustly
from a corrupted input and that will be useful for recovering the corresponding clean input.
 Denoising is a stochastic autoencoder as we use a stochastic corruption process to set some
of the inputs to zero
Contractive Autoencoders (CAE)

Contractive Autoencoders
 Contractive autoencoder(CAE) objective is to have a robust learned representation which is
less sensitive to small variation in the data.
 Robustness of the representation for the data is done by applying a penalty term to the loss
function. The penalty term is Frobenius norm of the Jacobian matrix. Frobenius norm of
the Jacobian matrix for the hidden layer is calculated with respect to input. Frobenius norm
of the Jacobian matrix is the sum of square of all elements.

 Loss function with penalty term — Frobenius norm of the Jacobian matrix
 Contractive autoencoder is another regularization technique like sparse autoencoders and
denoising autoencoders.
 CAE surpasses results obtained by regularizing autoencoder using weight decay or by
denoising. CAE is a better choice than denoising autoencoder to learn useful feature
extraction.
 Penalty term generates mapping which are strongly contracting the data and hence the
name contractive autoencoder.
Stacked Denoising Autoencoders

Stacked Denoising Autoencoder


 Stacked Autoencoders is a neural network with multiple layers of sparse autoencoders
 When we add more hidden layers than just one hidden layer to an autoencoder, it helps to
reduce a high dimensional data to a smaller code representing important features
 Each hidden layer is a more compact representation than the last hidden layer
 We can also denoise the input and then pass the data through the stacked autoencoders
called as stacked denoising autoencoders
 In Stacked Denoising Autoencoders, input corruption is used only for initial denoising. This
helps learn important features present in the data. Once the mapping function f(θ) has been
learnt. For further layers we use uncorrupted input from the previous layers.
 After training a stack of encoders as explained above, we can use the output of the stacked
denoising autoencoders as an input to a stand alone supervised machine learning like support
vector machines or multi class logistics regression.
Deep Autoencoders

Deep Autoencoders (Source: G. E. Hinton* and R. R. Salakhutdinov, Science, 2006)


 Deep Autoencoders consist of two identical deep belief networks. One network for encoding
and another for decoding
 Typically deep autoencoders have 4 to 5 layers for encoding and the next 4 to 5 layers for
decoding. We use unsupervised layer by layer pre-training
 Restricted Boltzmann Machine (RBM) is the basic building block of the deep belief network.
We will do RBM is a different post.
 In the above figure, we take an image with 784 pixels. Train using a stack of 4 RBMs, unroll
them and then finetune with back propagation
 Final encoding layer is compact and fast
5. Write briefly about Boltzmann machine
Deep Learning models are broadly classified into supervised and unsupervised models.
Supervised DL models:
 Artificial Neural Networks (ANNs)
 Recurrent Neural Networks (RNNs)
 Convolutional Neural Networks (CNNs)
Unsupervised DL models:
 Self Organizing Maps (SOMs)
 Boltzmann Machines
 Autoencoders
Let us learn what exactly Boltzmann machines are, how they work and also implement a
recommender system which recommends whether the user likes a movie or not based on the
previous movies watched.
Boltzmann Machines is an unsupervised DL model in which every node is connected to every
other node. That is, unlike the ANNs, CNNs, RNNs and SOMs, the Boltzmann Machines
are undirected (or the connections are bidirectional). Boltzmann Machine is not a deterministic
DL model but a stochastic or generative DL model. It is rather a representation of a certain
system. There are two types of nodes in the Boltzmann Machine — Visible nodes — those nodes
which we can and do measure, and the Hidden nodes – those nodes which we cannot or do not
measure. Although the node types are different, the Boltzmann machine considers them as the
same and everything works as one single system. The training data is fed into the Boltzmann
Machine and the weights of the system are adjusted accordingly. Boltzmann machines help us
understand abnormalities by learning about the working of the system in normal conditions.

Boltzmann Machine
Energy-Based Models:
Boltzmann Distribution is used in the sampling distribution of the Boltzmann Machine. The
Boltzmann distribution is governed by the equation –
Pi = e(-∈i/kT)/ ∑e(-∈j/kT)
Pi - probability of system being in state i
∈i - Energy of system in state i
T - Temperature of the system
k - Boltzmann constant
∑e(-∈j/kT) - Sum of values for all possible states of the system
Boltzmann Distribution describes different states of the system and thus Boltzmann machines
create different states of the machine using this distribution. From the above equation, as the
energy of system increases, the probability for the system to be in state ‘i’ decreases. Thus, the
system is the most stable in its lowest energy state (a gas is most stable when it spreads). Here, in
Boltzmann machines, the energy of the system is defined in terms of the weights of synapses.
Once the system is trained and the weights are set, the system always tries to find the lowest
energy state for itself by adjusting the weights.
Types of Boltzmann Machines:
 Restricted Boltzmann Machines (RBMs)
 Deep Belief Networks (DBNs)
 Deep Boltzmann Machines (DBMs)
Restricted Boltzmann Machines (RBMs):
In a full Boltzmann machine, each node is connected to every other node and hence the
connections grow exponentially. This is the reason we use RBMs. The restrictions in the node
connections in RBMs are as follows –
 Hidden nodes cannot be connected to one another.
 Visible nodes connected to one another.
Energy function example for Restricted Boltzmann Machine –
E(v, h) = -∑ aivi - ∑ bjhj - ∑∑ viwi,jhj
a, v - biases in the system - constants
vi, hj - visible node, hidden node
P(v, h) = Probability of being in a certain state
P(v, h) = e(-E(v, h))/Z
Z - sum if values for all possible states
Suppose that we are using our RBM for building a recommender system that works on six (6)
movies. RBM learns how to allocate the hidden nodes to certain features. By the process
of Contrastive Divergence, we make the RBM close to our set of movies that is our case or
scenario. RBM identifies which features are important by the training process. The training data is
either 0 or 1 or missing data based on whether a user liked that movie (1), disliked that movie (0)
or did not watch the movie (missing data). RBM automatically identifies important features.
Contrastive Divergence:
RBM adjusts its weights by this method. Using some randomly assigned initial weights, RBM
calculates the hidden nodes, which in turn use the same weights to reconstruct the input nodes.
Each hidden node is constructed from all the visible nodes and each visible node is reconstructed
from all the hidden node and hence, the input is different from the reconstructed input, though
the weights are the same. The process continues until the reconstructed input matches the
previous input. The process is said to be converged at this stage. This entire procedure is known
as Gibbs Sampling.

Gibb’s Sampling
The Gradient Formula gives the gradient of the log probability of the certain state of the system
with respect to the weights of the system. It is given as follows –
d/dwij(log(P(v0))) = <vi0 * hj0> - <vi∞ * hj∞>
v - Visible state, h- hidden state
<vi0 * hj0> - initial state of the system
<vi∞ * hj∞> - final state of the system
P (v0) - probability that the system is in state v 0
wij - weights of the system
The above equations tell us – how the change in weights of the system will change the log
probability of the system to be a particular state. The system tries to end up in the lowest
possible energy state (most stable). Instead of continuing the adjusting of weights process until
the current input matches the previous one, we can also consider the first few pauses only. It is
sufficient to understand how to adjust our curve so as to get the lowest energy state. Therefore,
we adjust the weights; redesign the system and energy curve such that we get the lowest energy
for the current position. This is known as the Hinton’s shortcut.
Hinton’s Shortcut
Working of RBM – Illustrative Example –
Consider – Mary watches four movies out of the six available movies and rates four of them. Say,
she watched m1, m3, m4 and m5 and likes m3, m5 (rated 1) and dislikes the other two, that is m 1,
m4 (rated 0) whereas the other two movies – m2, m6 are unrated. Now, using our RBM, we will
recommend one of these movies for her to watch next. Say –
 m3, m5 are of ‘Drama’ genre.
 m1, m4 are of ‘Action’ genre.
 ‘Dicaprio’ played a role in m 5.
 m3, m5 have won ‘Oscar.’
 ‘Tarantino’ directed m4.
 m2 is of the ‘Action’ genre.
 m6 is of both the genres ‘Action’ and ‘Drama’, ‘Dicaprio’ acted in it and it has won an
‘Oscar’.
We have the following observations –
 Mary likes m3, m5 and they are of genre ‘Drama,’ she probably likes ‘Drama’ movies.
 Mary dislikes m1, m4 and they are of action genre, she probably dislikes ‘Action’ movies.
 Mary likes m3, m5 and they have won an ‘Oscar’, she probably likes an ‘Oscar’ movie.
 Since ‘Dicaprio’ acted in m 5 and Mary likes it, she will probably like a movie in
which ‘Dicaprio’ acted.
 Mary does not like m4 which is directed by Tarantino, she probably dislikes any movie
directed by ‘Tarantino’.
Therefore, based on the observations and the details of m 2, m6; our RBM recommends m6 to
Mary (‘Drama’, ‘Dicaprio’ and ‘Oscar’ matches both Mary’s interests and m 6). This is how an RBM
works and hence is used in recommender systems.
Working of RBM
Thus, RBMs are used to build Recommender Systems.
Deep Belief Networks (DBNs):
Suppose we stack several RBMs on top of each other so that the first RBM outputs are the input
to the second RBM and so on. Such networks are known as Deep Belief Networks. The
connections within each layer are undirected (since each layer is an RBM). Simultaneously, those
in between the layers are directed (except the top two layers – the connection between the top
two layers is undirected). There are two ways to train the DBNs-
1. Greedy Layer-wise Training Algorithm – The RBMs are trained layer by layer. Once the
individual RBMs are trained (that is, the parameters – weights, biases are set), the
direction is set up between the DBN layers.
2. Wake-Sleep Algorithm – The DBN is trained all the way up (connections going up – wake)
and then down the network (connections going down — sleep).
Therefore, we stack the RBMs, train them, and once we have the parameters trained, we make
sure that the connections between the layers only work downwards (except for the top two
layers).
Deep Boltzmann Machines (DBMs):
DBMs are similar to DBNs except that apart from the connections within layers, the connections
between the layers are also undirected (unlike DBN in which the connections between layers are
directed). DBMs can extract more complex or sophisticated features and hence can be used for
more complex tasks.
6. Discuss about sparse coding and computer vision.
Sparse Coding: Sparse coding is a class of unsupervised methods for learning sets of over-completes
bases to represent data efficiently. Sparse coding aims to find a set of basis vectors ϕi such that we
can represent an input vector x as a linear combination of these basis vectors:

While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of
basis vectors efficiently, we wish to learn an over-complete set of basis vectors to represent input
vectors x∈Rn (i.e. such that k>n). The advantage of having an over-complete basis is that our basis
vectors are better able to capture structures and patterns inherent in the input data. However, with
an over-complete basis, the coefficients ai are no longer uniquely determined by the input vector x.
Therefore, in sparse coding, we introduce the additional criterion of sparsity to resolve the
degeneracy introduced by over-completeness. 3 Here, we define sparsity as having few non-zero
components or having few components not close to zero. The requirement that our coefficients ai is
sparse means that given an input vector, we would like as few of our coefficients to be far from zero
as possible. The choice of sparsity as a desired characteristic of our representation of the input data
can be motivated by the observation that most sensory data such as natural images may be
described as the superposition of a small number of atomic elements such as surfaces or edges.
Other justifications such as comparisons to the properties of the primary visual cortex have also
been advanced.
We define the sparse coding cost function on a set of m input vectors as

where S(.) is a sparsity cost function that penalizes ai for being far from zero. We can interpret the
first term of the sparse coding objective as a reconstruction term that tries to force the algorithm to
provide a good representation of x and the second term as a sparsity penalty which forces our
representation of x to be sparse. The constant λ is a scaling constant to determine the relative
importance of these two contributions.
Although the most direct measure of sparsity is the ”L0” norm (S(ai)=1(|ai|>0)), it is non-
differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost
S(.) are the L1 penalty S(ai)=|ai|1 and the log penalty S(ai)=log(1+a2i).
Also, it is possible to make the sparsity penalty arbitrarily small by scaling down ai and scaling ϕi up
by some large constant. To prevent this from 4 happenings, we will constrain ||ϕ||2 to be less than
some constant C. The full sparse coding cost function including our constraint on ϕ is
Computer vision has traditionally been one of the most active research areas for deep learning
applications, because vision is a task that is effortless for humans and many animals but challenging
for computers (Ballard et al., 1983). Many of the most popular standard benchmark tasks for deep
learning algorithms are forms of object recognition or optical character recognition. Computer
vision is a very broad field encompassing a wide variety of ways of processing images, and an
amazing diversity of applications. Applications of computer vision range from reproducing human
visual abilities, such as recognizing faces, to creating entirely new categories of visual abilities.
computer vision task involving repairing defects in images or removing objects from images.
Preprocessing
Many application areas require sophisticated preprocessing because the original input comes in a
form that is difficult for much deep learning architecture to represent. Computer vision usually
requires relatively little of this kind of preprocessing. The images should be standardized so that
their pixels all lie in the same, reasonable range, like [0,1] or [-1, 1]. Mixing images that lie in [0,1]
with images that lie in [0, 255] will usually result in failure. Formatting images to have the same
scale is the only kind of preprocessing that is strictly necessary. Many computer vision architectures
require images of a standard size, so images must be cropped or scaled to fit that size. However,
even this rescaling is not always strictly necessary. Some convolutional models accept variably-sized
inputs and dynamically adjust the size of their pooling regions to keep the output size constant
Contrast Normalization
One of the most obvious sources of variation that can be safely removed for many tasks is the
amount of contrast in the image. Contrast simply refers to the magnitude of the difference
between the bright and the dark pixels in an image. There are many ways of quantifying the
contrast of an image. In the context of deep learning, contrast usually refers to the standard
deviation of the pixels in an image or region of an image.

Global contrast normalization will often fail to highlight image features we would like to stand out,
such as edges and corners. If we have a scene with a large dark area and a large bright area (such as
a city square with half the image in the shadow of a building) then global contrast normalization will
ensure there is a large difference between the brightness of the dark area and the brightness of the
light area. It will not, however, ensure that edges within the dark region stand out. This motivates
local contrast normalization. Local contrast normalization ensures that the contrast is normalized
across each small window, rather than over the image as a whole.

Dataset Augmentation
it is easy to improve the generalization of a classifier by increasing the size of the training set by
adding extra copies of the training examples that have been modified with transformations that do
not change the 459 http://www.deeplearningbook.org/contents/applications.html 15 of 43
6/20/16, 13:55 CHAPTER 12. APPLICATIONS class. Object recognition is a classification task that is
especially amenable to this form of dataset augmentation because the class is invariant to so many
transformations and the input can be easily transformed with many geometric operations. As
described before, classifiers can benefit from random translations, rotations, and in some cases,
flips of the input to augment the dataset. In specialized computer vision applications, more
advanced transformations are commonly used for dataset augmentation
7. Discuss the features of Tensor flow, caffe, Theano, Torch tools.
Theano: Theano is a Python library that lets you define, optimize, and evaluate mathematical
expressions, especially ones with multi-dimensional arrays (NumPy.ndarray). Using Theano it is
possible to attain speeds rivaling handcrafted C implementations for problems involving large
amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of
recent GPUs.
Theano combines aspects of a computer algebra system (CAS) with aspects of an optimizing
compiler. It can also generate customized C code for many mathematical operations. This
combination of CAS with optimizing compilation is particularly useful for tasks in which complicated
mathematical expressions are evaluated repeatedly and evaluation speed is critical. For situations
where many different expressions are each evaluated once Theano can minimize the amount of
compilation/analysis overhead, but still provide symbolic features such as automatic differentiation.
Theano’s compiler applies many optimizations of varying complexity to these symbolic expressions.
These optimizations include, but are not limited to:
- Use of GPU for computations
- Constant folding
- Merging of similar subgraphs, to avoid redundant calculations
- Arithmetic simplification (e.g. X*y/x -> y, — x -> x)
- Inserting efficient BLAS operations (e.g. GEMM) in a variety of contexts
- Using memory aliasing to avoid calculation
- Using in-place operations wherever it does not interfere with aliasing
- Loop fusion for elementwise sub-expressions
- Improvements to numerical stability (log(1+exp(x)) and log(sum_i \exp(x[i]))
- Optimizations
Theano was written at the LISA lab to support the rapid development of efficient machine learning
algorithms. Theano is named after the Greek mathematician, who may have been Pythagoras’ wife.
Theano is released under a BSD license.
Here is an example of how to use Theano. It doesn’t show off many of Theano’s features, but it
illustrates concretely what Theano is.
import theano
from theano import tensor
# declare two symbolic floating-point scalars
a = tensor.dscalar()
b = tensor.dscalar()
# create a simple expression
c=a+b
# convert the expression into a callable object that takes (a,b)
# values as input and computes a value for c
f = theano.function([a,b], c)
# bind 1.5 to ‘a’, 2.5 to ‘b’, and evaluate ‘c’
assert 4.0 == f(1.5, 2.5)
Theano is not a programming language in the normal sense because you write a program in Python
that builds expressions for Theano. Still, it is like a programming language in the sense that you have
to
- Declare variables (a,b) and give their types
- Build expressions for how to put those variables together
- Compile expression graphs to functions to use them for computation
It is good to think of theano.function as the interface to a compiler which builds a callable object
from a purely symbolic graph. One of Theano’s most important features is that theano.function can
optimize a graph and even compile some or all of it into native machine instructions.
execution speed optimizations: Theano can use g++ or nvcc to compile parts of your expression
graph into CPU or GPU instructions, which run much faster than pure Python.
- symbolic differentiation: Theano can automatically build symbolic graphs for computing gradients.
- stability optimizations: Theano can recognize [some] numerically unstable expressions and
compute them with more stable algorithms.
The closest Python package to Theano is SymPy. Theano focuses more on tensor expressions than
SymPy and has more machinery for compilation. Sympy has more sophisticated algebra rules and
can handle a wider variety of mathematical operations (such as series, limits, and integrals).
Procedure:
-Import necessary libraries
- First load the required dataset
- Do the necessary preprocessing as our dataset is in CSV form, so convert the dataset into matrix
form
- Now divide the data into X and Y part
- First, pass the image from a sparse coder and convert the images into a sparse representation
- Now build a 2- layer neural network with 10 neurons in output layer as we are doing digit
recognition
- Now pass the Sparse encoded images through the network and note down the cost at each
iteration
- Now draw a graph between no of iterations and cost function
- And if the cost function decreases at each iteration then our model is doing great
· Since at each iterations cost is decreasing so our model is correct.
· We need more iterations for the cost of approaches to zero.

TensorFlow is an open-source end-to-end platform for creating Machine Learning applications. It is


a symbolic math library that uses dataflow and differentiable programming to perform various tasks
focused on training and inference of deep neural networks. It allows developers to create machine
learning applications using various tools, libraries, and community resources.
Currently, the most famous deep learning library in the world is Google’s TensorFlow. Google
product uses machine learning in all of its products to improve the search engine, translation,
image captioning or recommendations.
TensorFlow Example
To give a concrete example, Google users can experience a faster and more refined search
experience with AI. If the user types a keyword in the search bar, Google provides a
recommendation about what could be the next word.

TensorFlow Example
Google wants to use machine learning to take advantage of their massive datasets to give users the
best experience. Three different groups use machine learning:
 Researchers
 Data Scientists
 Programmers
They can all use the same toolset to collaborate with each other and improve their efficiency.
Google does not just have any data; they have the world’s most massive computer, so Tensor Flow
was built to scale. TensorFlow is a library developed by the Google Brain Team to accelerate
machine learning and deep neural network research.
It was built to run on multiple CPUs or GPUs and even mobile operating systems, and it has several
wrappers in several languages like Python, C++ or Java.

History of TensorFlow
A couple of years ago, deep learning started to outperform all other machine learning algorithms
when giving a massive amount of data. Google saw it could use these deep neural networks to
improve its services:
 Gmail
 Photo
 Google search engine
They build a framework called Tensorflow to let researchers and developers work together on an AI
model. Once developed and scaled, it allows lots of people to use it.
It was first made public in late 2015, while the first stable version appeared in 2017. It is open
source under Apache Open Source license. You can use it, modify it and redistribute the modified
version for a fee without paying anything to Google.
Next in this TensorFlow Deep learning tutorial, we will learn about TensorFlow architecture and
how does TensorFlow work.
How TensorFlow Works
TensorFlow enables you to build dataflow graphs and structures to define how data moves through
a graph by taking inputs as a multi-dimensional array called Tensor. It allows you to construct a
flowchart of operations that can be performed on these inputs, which goes at one end and comes
at the other end as output.
TensorFlow Architecture
Tensorflow architecture works in three parts:
 Preprocessing the data
 Build the model
 Train and estimate the model
It is called Tensorflow because it takes input as a multi-dimensional array, also known as tensors.
You can construct a sort of flowchart of operations (called a Graph) that you want to perform on
that input. The input goes in at one end, and then it flows through this system of multiple
operations and comes out the other end as output.
This is why it is called TensorFlow because the tensor goes in it flows through a list of operations,
and then it comes out the other side.
Where can Tensorflow run?
TensorFlow hardware, and software requirements can be classified into
Development Phase: This is when you train the mode. Training is usually done on your Desktop or
laptop.
Run Phase or Inference Phase: Once training is done Tensorflow can be run on many different
platforms. You can run it on
 Desktop running Windows, macOS or Linux
 Cloud as a web service
 Mobile devices like iOS and Android
You can train it on multiple machines then you can run it on a different machine, once you have the
trained model.
TensorFlow Components
Tensor
Tensorflow’s name is directly derived from its core framework: Tensor. In Tensorflow, all the
computations involve tensors. A tensor is a vector or matrix of n-dimensions that represents all
types of data. All values in a tensor hold identical data type with a known (or partially
known) shape. The shape of the data is the dimensionality of the matrix or array.
A tensor can be originated from the input data or the result of a computation. In TensorFlow, all the
operations are conducted inside a graph. The graph is a set of computation that takes place
successively. Each operation is called an op node and are connected to each other.
The graph outlines the ops and connections between the nodes. However, it does not display the
values. The edge of the nodes is the tensor, i.e., a way to populate the operation with data.
Graphs
TensorFlow makes use of a graph framework. The graph gathers and describes all the series
computations done during the training. The graph has lots of advantages:
 It was done to run on multiple CPUs or GPUs and even mobile operating system
 The portability of the graph allows to preserve the computations for immediate or later use.
The graph can be saved to be executed in the future.
 All the computations in the graph are done by connecting tensors together
o A tensor has a node and an edge. The node carries the mathematical operation and
produces an endpoints outputs. The edges the edges explain the input/output
relationships between nodes.
Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is
developed by Berkeley AI Research (BAIR) and by community contributors. Yangqing Jia created the
project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license.
Check out our web image classification demo!
Why Caffe?
Expressive architecture encourages application and innovation. Models and optimization are
defined by configuration without hard-coding. Switch between CPU and GPU by setting a single flag
to train on a GPU machine then deploy to commodity clusters or mobile devices.
Extensible code fosters active development. In Caffe’s first year, it has been forked by over 1,000
developers and had many significant changes contributed back. Thanks to these contributors the
framework tracks the state-of-the-art in both code and models.
Speed makes Caffe perfect for research experiments and industry deployment. Caffe can
process over 60M images per day with a single NVIDIA K40 GPU*. That’s 1 ms/image for inference
and 4 ms/image for learning and more recent library versions and hardware are faster still. We
believe that Caffe is among the fastest convnet implementations available.
Community: Caffe already powers academic research projects, startup prototypes, and even large-
scale industrial applications in vision, speech, and multimedia. Join our community of brewers on
the caffe-users group and Github.
What Are Torch and PyTorch?
PyTorch is an open-source Python library for deep learning developed and maintained by Facebook.

The project started in 2016 and quickly became a popular framework among developers and
researchers.

Torch (Torch7) is an open-source project for deep learning written in C and generally used via the
Lua interface. It was a precursor project to PyTorch and is no longer actively developed. PyTorch
includes “Torch” in the name, acknowledging the prior torch library with the “Py” prefix indicating
the Python focus of the new project.

The PyTorch API is simple and flexible, making it a favorite for academics and researchers in the
development of new deep learning models and applications. The extensive use has led to many
extensions for specific applications (such as text, computer vision, and audio data), and may pre-
trained models that can be used directly. As such, it may be the most popular library used by
academics.

The flexibility of PyTorch comes at the cost of ease of use, especially for beginners, as compared to
simpler interfaces like Keras. The choice to use PyTorch instead of Keras gives up some ease of use,
a slightly steeper learning curve, and more code for more flexibility, and perhaps a more vibrant
academic community.

1.2. How to Install PyTorch

Before installing PyTorch, ensure that you have Python installed, such as Python 3.6 or higher.

If you don’t have Python installed, you can install it using Anaconda. This tutorial will show you
how:

How to Setup Your Python Environment for Machine Learning With Anaconda

There are many ways to install the PyTorch open-source deep learning library.

The most common, and perhaps simplest, way to install PyTorch on your workstation is by using
pip.

For example, on the command line, you can type:

1 sudo pip install torch

Perhaps the most popular application of deep learning is for computer vision, and the PyTorch
computer vision package is called “torchvision.”

Installing torchvision is also highly recommended and it can be installed as follows:

1 sudo pip install torchvision

If you prefer to use an installation method more specific to your platform or package manager, you
can see a complete list of installation instructions here:

There is no need to set up the GPU now.

All examples in this tutorial will work just fine on a modern CPU. If you want to configure PyTorch
for your GPU, you can do that after completing this tutorial. Don’t get distracted!

How to Confirm PyTorch Is Installed

Once PyTorch is installed, it is important to confirm that the library was installed successfully and
that you can start using it.

Don’t skip this step.

If PyTorch is not installed correctly or raises an error on this step, you won’t be able to run the
examples later.

Create a new file called versions.py and copy and paste the following code into the file.

1 # check pytorch version

2 import torch

3 print(torch.__version__)

Save the file, then open your command line and change directory to where you saved the file.
Then type:

1 python versions.py

You should then see output like the following:

1 1.3.1

This confirms that PyTorch is installed correctly and that we are all using the same version.

This also shows you how to run a Python script from the command line. I recommend running all
code from the command line in this manner, and not from a notebook or an IDE.
8. List and explain NLP packages and tools with examples.
Natural language processing (NLP) is the use of human languages, such as English or French, by a
computer. Computer programs typically read and emit specialized languages designed to allow
efficient and unambiguous parsing by simple programs. More naturally occurring languages are
often ambiguous and defy formal description. Natural language processing includes applications
such as machine translation, in which the learner must read a sentence in one human language and
emit an equivalent sentence in another human language. Many NLP applications are based on
language models that define a probability distribution over sequences of words, characters or bytes
in a natural language.
5 Best NLP tools and libraries
1. NLTK - entry-level open-source NLP Tool
Natural Language Toolkit (AKA NLTK) is an open-source software powered with Python NLP. From
this point, the NLTK library is a standard NLP tool developed for research and education.
NLTK provides users with a basic set of tools for text-related operations. It is a good starting point
for beginners in Natural Language Processing.
Natural Language Toolkit features include:
 Text classification
 Part-of-speech tagging
 Entity extraction
 Tokenization
 Parsing
 Stemming
 Semantic reasoning
2. Stanford Core NLP - Data Analysis, Sentiment Analysis, Conversational UI
We can say that the Stanford NLP library is a multi-purpose tool for text analysis. Like NLTK,
Stanford CoreNLP provides many different natural language processing software. But if you need
more, you can use custom modules.
The main advantage of Stanford NLP tools is scalability. Unlike NLTK, Stanford Core NLP is a perfect
choice for processing large amounts of data and performing complex operations.
With its high scalability, Stanford CoreNLP is an excellent choice for:
 information scraping from open sources (social media, user-generated reviews)
 sentiment analysis (social media, customer support)
 conversational interfaces(chatbots)
 text processing, and generation(customer support, e-commerce)
3. Apache OpenNLP - Data Analysis and Sentiment Analysis
Accessibility is essential when you need a tool for long-term use, which is challenging in the realm
of Natural Language Processing open-source tools. Because while being powered with the right
features, it could be too complex to use.
Apache OpenNLP is an open-source library for those who prefer practicality and accessibility. Like
Stanford CoreNLP, it uses Java NLP libraries with Python decorators.
While NLTK and Stanford CoreNLP are state-of-the-art libraries with tons of additions, OpenNLP is a
simple yet useful tool. Besides, you can configure OpenNLP in the way you need and get rid of
unnecessary features.
Apache OpenNLP is the right choice for:
 Named Entity Recognition
 Sentence Detection
 POS tagging
 Tokenization
4. GenSim - Document Analysis, Semantic Search, Data Exploration
Sometimes you need to extract particular information to discover business insights. GenSim is the
perfect tool for such things. It is an open-source NLP library designed for document exploration and
topic modeling. It would help you to navigate the various databases and documents.
The key GenSim feature is word vectors. It sees the content of the documents as sequences of
vectors and clusters. And then, GenSim classifies them.
GenSim is also resource-saving when it comes to dealing with a large amount of data.
The main GenSim use cases are:
 Data analysis
 Semantic search applications
 Text generation applications (chatbot, service customization, text summarization, etc.)
5. Intel NLP Architect - Data Exploration, Conversational UI3
Intel NLP Architect is the newer application in this list. Intel NLP Architect uses Python library for
deep learning using recurrent neural networks. You can use it for:
 text generation and summarization
 aspect-based sentiment analysis
 and conversational interfaces such as chatbots

You might also like