Deep Learnong
Deep Learnong
Deep Learnong
For deep versus shallow learning in educational psychology, see Student approaches to
learning.
Problems[show]
Supervised learning
(classification regression)
[show]
Clustering[show]
Dimensionality reduction[show]
Structured prediction[show]
Anomaly detection[show]
Neural nets[show]
Reinforcement Learning[show]
Theory[show]
Deep learning (also known as deep structured learning, hierarchical learning or deep
machine learning) is a branch of machine learning based on a set of algorithms that
attempt to model high level abstractions in data. In a simple case, you could have two sets
of neurons: ones that receive an input signal and ones that send an output signal. When
the input layer receives an input it passes on a modified version of the input to the next
layer. In a deep network, there are many layers between the input and output (and the
layers are not made of neurons but it can help to think of it that way), allowing the algorithm
to use multiple processing layers, composed of multiple linear and non-linear
transformations.[1][2][3][4][5][6][7][8][9]
Deep learning is part of a broader family of machine learning methods based on learning
representations of data. An observation (e.g., an image) can be represented in many ways
such as a vector of intensity values per pixel, or in a more abstract way as a set of edges,
regions of particular shape, etc. Some representations are better than others at simplifying
the learning task (e.g., face recognition or facial expression recognition[10]). One of the
promises of deep learning is replacing handcrafted features with efficient algorithms
for unsupervised or semi-supervised feature learning and hierarchical feature extraction.[11]
Research in this area attempts to make better representations and create models to learn
these representations from large-scale unlabeled data. Some of the representations are
inspired by advances in neuroscience and are loosely based on interpretation of
information processing and communication patterns in a nervous system, such as neural
coding which attempts to define a relationship between various stimuli and associated
neuronal responses in the brain.[12]
Various deep learning architectures such as deep neural networks, convolutional deep
neural networks, deep belief networks and recurrent neural networks have been applied to
fields like computer vision, automatic speech recognition, natural language processing,
audio recognition and bioinformatics where they have been shown to produce state-of-theart results on various tasks.
Deep learning has been characterized as a buzzword, or a rebranding of neural networks.[13]
[14]
Companies such as Google are giving their employees classes in how to use it. [15]
Contents
[hide]
1Introduction
o
1.1Definitions
1.2Fundamental concepts
2Interpretations
2.2Probabilistic interpretation
3History
5.1.1Backpropagation
5.14Spike-and-slab RBMs
5.17Deep Q-networks
5.18.2Semantic hashing
5.18.4Memory networks
5.18.5Pointer networks
5.18.6Encoderdecoder networks
6Other architectures
7.2Image recognition
7.6Recommendation systems
7.7Biomedical Informatics
9Commercial activities
11Software libraries
12See also
13References
14External links
Introduction[edit]
Definitions[edit]
Deep learning is characterized as a class of machine learning algorithms that[2](pp199200)
These definitions have in common (1) multiple layers of nonlinear processing units and (2)
the supervised or unsupervised learning of feature representations in each layer, with the
layers forming a hierarchy from low-level to high-level features.[2](p200) The composition of a
layer of nonlinear processing units used in a deep learning algorithm depends on the
problem to be solved. Layers that have been used in deep learning include hidden layers of
an artificial neural network and sets of complicated propositional formulas.[3] They may also
include latent variables organized layer-wise in deep generative models such as the nodes
in Deep Belief Networks and Deep Boltzmann Machines.
Deep learning algorithms transform their inputs through more layers than shallow learning
algorithms. At each layer, the signal is transformed by a processing unit, like an artificial
neuron, whose parameters are 'learned' through training.[5](p6) A chain of transformations
from input to output is a credit assignment path (CAP). CAPs describe potentially causal
connections between input and output and may vary in length for a feedforward neural
network, the depth of the CAPs (thus of the network) is the number of hidden layers plus
one (as the output layer is also parameterized), but for recurrent neural networks, in which
a signal may propagate through a layer more than once, the CAP is potentially unlimited in
length. There is no universally agreed upon threshold of depth dividing shallow learning
from deep learning, but most researchers in the field agree that deep learning has multiple
nonlinear layers (CAP > 2) and Juergen Schmidhuber considers CAP > 10 to be very deep
learning.[5](p7)
Fundamental concepts[edit]
Deep learning algorithms are based on distributed representations. The underlying
assumption behind distributed representations is that observed data are generated by the
interactions of factors organized in layers. Deep learning adds the assumption that these
layers of factors correspond to levels of abstraction or composition. Varying numbers of
layers and layer sizes can be used to provide different amounts of abstraction. [4]
Deep learning exploits this idea of hierarchical explanatory factors where higher level, more
abstract concepts are learned from the lower level ones. These architectures are often
constructed with a greedy layer-by-layer method. Deep learning helps to disentangle these
abstractions and pick out which features are useful for learning.[4]
For supervised learning tasks, deep learning methods obviate feature engineering, by
translating the data into compact intermediate representations akin to principal
components, and derive layered structures which remove redundancy in representation. [2]
Many deep learning algorithms are applied to unsupervised learning tasks. This is an
important benefit because unlabeled data are usually more abundant than labeled data.
Examples of deep structures that can be trained in an unsupervised manner are neural
history compressors[16] and deep belief networks.[4][17]
Interpretations[edit]
Deep neural networks are generally interpreted in terms of: Universal approximation
theorem[18][19][20][21][22] or Probabilistic inference.[2][3][4][5][17][23]
In 1989, the first proof was published by George Cybenko for sigmoid activation
functions[19] and was generalised to feed-forward multi-layer architectures in 1991 by Kurt
Hornik.[20]
Probabilistic interpretation[edit]
The probabilistic interpretation[23] derives from the field of machine learning. It features
inference,[2][3][4][5][17][23] as well as the optimization concepts of training and testing, related to
fitting and generalization respectively. More specifically, the probabilistic interpretation
considers the activation nonlinearity as a cumulative distribution function.[23] See Deep belief
network. The probabilistic interpretation led to the introduction of dropout as regularizer in
neural networks.[24]
The probabilistic interpretation was introduced and popularized by Geoff Hinton, Yoshua
Bengio, Yann LeCun and Juergen Schmidhuber.
History[edit]
The first general, working learning algorithm for supervised deep feedforward multilayer
perceptrons was published by Ivakhnenko and Lapa in 1965.[25] A 1971 paper[26] described a
deep network with 8 layers trained by the Group method of data handling algorithm which is
still popular in the current millennium[citation needed]. These ideas were implemented in a computer
identification system "Alpha", which demonstrated the learning process. Other Deep
Learning working architectures, specifically those built from artificial neural networks (ANN),
date back to the Neocognitron introduced by Kunihiko Fukushima in 1980.[27] The ANNs
themselves date back even further. The challenge was how to train networks with multiple
layers. In 1989, Yann LeCun et al. were able to apply the
standard backpropagation algorithm, which had been around as the reverse mode
of automatic differentiation since 1970,[28][29][30][31] to a deep neural network with the purpose of
recognizing handwritten ZIP codes on mail. Despite the success of applying the algorithm,
the time to train the network on this dataset was approximately 3 days, making it
impractical for general use.[32] In 1993, Jrgen Schmidhuber's neural history
compressor[16] implemented as an unsupervised stack of recurrent neural networks (RNNs)
solved a "Very Deep Learning" task[5] that requires more than 1,000 subsequent layers in an
RNN unfolded in time.[33] In 1995, Brendan Frey demonstrated that it was possible to train a
network containing six fully connected layers and several hundred hidden units using
the wake-sleep algorithm, which was co-developed with Peter Dayan and Geoffrey Hinton.
[34]
However, training took two days.
Many factors contribute to the slow speed, one being the vanishing gradient
problem analyzed in 1991 by Sepp Hochreiter.[35][36]
While by 1991 such neural networks were used for recognizing isolated 2-D hand-written
digits, recognizing 3-D objects was done by matching 2-D images with a handcrafted 3-D
object model. Juyang Weng et al. suggested that a human brain does not use a monolithic
3-D object model, and in 1992 they published Cresceptron,[37][38][39] a method for performing
3-D object recognition directly from cluttered scenes. Cresceptron is a cascade of layers
similar to Neocognitron. But while Neocognitron required a human programmer to handmerge features, Cresceptron automatically learned an open number of unsupervised
features in each layer, where each feature is represented by a convolution kernel.
Cresceptron also segmented each learned object from a cluttered scene through backanalysis through the network. Max pooling, now often adopted by deep neural networks
(e.g. ImageNet tests), was first used in Cresceptron to reduce the position resolution by a
factor of (2x2) to 1 through the cascade for better generalization. Despite these
advantages, simpler models that use task-specific handcrafted features such as Gabor
filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s,
because of the computational cost of ANNs at the time, and a great lack of understanding
of how the brain autonomously wires its biological networks.
In the long history of speech recognition, both shallow and deep learning (e.g., recurrent
nets) of artificial neural networks have been explored for many years. [40][41][42] But these
methods never won over the non-uniform internal-handcrafting Gaussian mixture
model/Hidden Markov model (GMM-HMM) technology based on generative models of
speech trained discriminatively.[43] A number of key difficulties have been methodologically
analyzed, including gradient diminishing[35] and weak temporal correlation structure in the
neural predictive models.[44][45] Additional difficulties were the lack of big training data and
weaker computing power in these early days. Thus, most speech recognition researchers
who understood such barriers moved away from neural nets to pursue generative
modeling. An exception was at SRI International in the late 1990s. Funded by the US
government's NSA and DARPA, SRI conducted research on deep neural networks in
speech and speaker recognition. The speaker recognition team, led by Larry Heck,
achieved the first significant success with deep neural networks in speech processing as
demonstrated in the 1998 NIST (National Institute of Standards and Technology) Speaker
Recognition evaluation and later published in the journal of Speech Communication.
[46]
While SRI established success with deep neural networks in speaker recognition, they
were unsuccessful in demonstrating similar success in speech recognition. Hinton et al.
and Deng et al. reviewed part of this recent history about how their collaboration with each
other and then with colleagues across four groups (University of Toronto, Microsoft,
Google, and IBM) ignited a renaissance of deep feedforward neural networks in speech
recognition.[47][48][49][50]
Today, however, many aspects of speech recognition have been taken over by a deep
learning method called Long short-term memory (LSTM), a recurrent neural
network published by Sepp Hochreiter & Jrgen Schmidhuber in 1997.[51] LSTM RNNs
avoid the vanishing gradient problem and can learn "Very Deep Learning" tasks[5] that
require memories of events that happened thousands of discrete time steps ago, which is
important for speech. In 2003, LSTM started to become competitive with traditional speech
recognizers on certain tasks.[52] Later it was combined with CTC[53] in stacks of LSTM RNNs.
[54]
In 2015, Google's speech recognition reportedly experienced a dramatic performance
jump of 49% through CTC-trained LSTM, which is now available through Google Voice to
all smartphone users,[55] and has become a show case of deep learning.
The use of the expression "Deep Learning" in the context of Artificial Neural Networks was
introduced by Igor Aizenberg and colleagues in 2000. [56] A Google Ngram chart shows that
the usage of the term has gained traction (actually has taken off) since 2000. [57] In 2006, a
publication by Geoffrey Hinton and Ruslan Salakhutdinov drew additional attention by
showing how many-layered feedforward neural network could be effectively pre-trained one
layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann
machine, then fine-tuning it using supervised backpropagation.[58] In 1992, Schmidhuber
had already implemented a very similar idea for the more general case of unsupervised
deep hierarchies of recurrent neural networks, and also experimentally shown its benefits
for speeding up supervised learning.[16][59]
Since its resurgence, deep learning has become part of many state-of-the-art systems in
various disciplines, particularly computer vision and automatic speech recognition (ASR).
Results on commonly used evaluation sets such as TIMIT (ASR) and MNIST (image
classification), as well as a range of large-vocabulary speech recognition tasks are
constantly being improved with new applications of deep learning.[47][60][61] Recently, it was
shown that deep learning architectures in the form of convolutional neural networks have
been nearly best performing;[62][63] however, these are more widely used in computer vision
than in ASR, and modern large scale speech recognition is typically based on CTC [53] for
LSTM.[51][55][64][65][66]
The real impact of deep learning in industry apparently began in the early 2000s, when
CNNs already processed an estimated 10% to 20% of all the checks written in the US in
the early 2000s, according to Yann LeCun.[67] Industrial applications of deep learning to
large-scale speech recognition started around 2010. In late 2009, Li Deng invited Geoffrey
Hinton to work with him and colleagues at Microsoft Research in Redmond, Washington to
apply deep learning to speech recognition. They co-organized the 2009 NIPS Workshop on
Deep Learning for Speech Recognition. The workshop was motivated by the limitations of
deep generative models of speech, and the possibility that the big-compute, big-data era
warranted a serious try of deep neural nets (DNN). It was believed that pre-training DNNs
using generative models of deep belief nets (DBN) would overcome the main difficulties of
neural nets encountered in the 1990s.[49] However, early into this research at Microsoft, it
was discovered that without pre-training, but using large amounts of training data, and
especially DNNs designed with corresponding large, context-dependent output layers,
produced error rates dramatically lower than then-state-of-the-art GMM-HMM and also than
more advanced generative model-based speech recognition systems. This finding was
verified by several other major speech recognition research groups. [47][68] Further, the nature
of recognition errors produced by the two types of systems was found to be
characteristically different,[48][69] offering technical insights into how to integrate deep learning
into the existing highly efficient, run-time speech decoding system deployed by all major
players in speech recognition industry. The history of this significant development in deep
learning has been described and analyzed in recent books and articles. [2][70][71]
Advances in hardware have also been important in enabling the renewed interest in deep
learning. In particular, powerful graphics processing units (GPUs) are well-suited for the
kind of number crunching, matrix/vector math involved in machine learning. [72][73] GPUs have
been shown to speed up training algorithms by orders of magnitude, bringing running times
of weeks back to days.[74][75]
sampling down the model (an "ancestral pass") from the top level feature activations.
[89]
Hinton reports that his models are effective feature extractors over high-dimensional,
structured data.[90]
In 2012, the Google Brain team led by Andrew Ng and Jeff Dean created a neural network
that learned to recognize higher-level concepts, such as cats, only from watching unlabeled
images taken from YouTube videos.[91][92]
Other methods rely on the sheer processing power of modern computers, in
particular, GPUs. In 2010, Dan Ciresan and colleagues[74] in Jrgen Schmidhuber's group at
the Swiss AI Lab IDSIA showed that despite the above-mentioned "vanishing gradient
problem," the superior processing power of GPUs makes plain back-propagation feasible
for deep feedforward neural networks with many layers. The method outperformed all other
machine learning techniques on the old, famous MNIST handwritten digits problem of Yann
LeCun and colleagues at NYU.
At about the same time, in late 2009, deep learning feedforward networks made inroads
into speech recognition, as marked by the NIPS Workshop on Deep Learning for Speech
Recognition. Intensive collaborative work between Microsoft Research and University of
Toronto researchers demonstrated by mid-2010 in Redmond that deep neural networks
interfaced with a hidden Markov model with context-dependent states that define the neural
network output layer can drastically reduce errors in large-vocabulary speech recognition
tasks such as voice search. The same deep neural net model was shown to scale up to
Switchboard tasks about one year later at Microsoft Research Asia. Even earlier, in 2007,
LSTM[51] trained by CTC[53] started to get excellent results in certain applications. [54] This
method is now widely used, for example, in Google's greatly improved speech recognition
for all smartphone users.[55]
As of 2011, the state of the art in deep learning feedforward networks alternates
convolutional layers and max-pooling layers,[93][94] topped by several fully connected or
sparsely connected layer followed by a final classification layer. Training is usually done
without any unsupervised pre-training. Since 2011, GPU-based implementations [93] of this
approach won many pattern recognition contests, including the IJCNN 2011 Traffic Sign
Recognition Competition,[95] the ISBI 2012 Segmentation of neuronal structures in EM
stacks challenge,[96] the ImageNet Competition,[97] and others.
Such supervised deep learning methods also were the first artificial pattern recognizers to
achieve human-competitive performance on certain tasks.[98]
To overcome the barriers of weak AI represented by deep learning, it is necessary to go
beyond deep learning architectures, because biological brains use both shallow and deep
circuits as reported by brain anatomy[99] displaying a wide variety of invariance.
Weng[100] argued that the brain self-wires largely according to signal statistics and, therefore,
a serial cascade cannot catch all major statistical dependencies. ANNs were able to
guarantee shift invariance to deal with small and large natural objects in large cluttered
scenes, only when invariance extended beyond shift, to all ANN-learned concepts, such as
location, type (object class label), scale, lighting. This was realized in Developmental
Networks (DNs)[101] whose embodiments are Where-What Networks, WWN-1 (2008)
[102]
through WWN-7 (2013).[103]
There are huge number of variants of deep architectures. Most of them are branched from
some original parent architectures. It is not always possible to compare the performance of
multiple architectures all together, because they are not all evaluated on the same data
sets. Deep learning is a fast-growing field, and new architectures, variants, or algorithms
appear every few weeks.
Here, is the learning rate, is the cost function and a stochastic term. The choice of the
cost function depends on factors such as the learning type (supervised,
unsupervised, reinforcement, etc.) and the activation function. For example, when
performing supervised learning on a multiclass classification problem, common choices
for the activation function and cost function are the softmax function and cross
entropy function, respectively. The softmax function is defined as where represents
the class probability (output of the unit ) and and represent the total input to
units and of the same level respectively. Cross entropy is defined
as where represents the target probability for output unit and is the probability output
for after applying the activation function.[119]
These can be used to output object bounding boxes in the form of a binary mask. They
are also used for multi-scale regression to increase localization precision. DNN-based
regression can learn features that capture geometric information in addition to being a
good classifier. They remove the limitation of designing a model which will capture
parts and their relations explicitly. This helps to learn a wide variety of objects. The
model consists of multiple layers, each of which has a rectified linear unit for non-linear
transformation. Some layers are convolutional, while others are fully connected. Every
convolutional layer has an additional max pooling. The network is trained to minimize
L2 error for predicting the mask ranging over the entire training set containing bounding
boxes represented as masks.
Problems with deep neural networks[edit]
As with ANNs, many issues can arise with DNNs if they are naively trained. Two
common issues are overfitting and computation time.
DNNs are prone to overfitting because of the added layers of abstraction, which allow
them to model rare dependencies in the training data. Regularization methods such as
Ivakhnenko's unit pruning[26] or weight decay (-regularization) or sparsity (regularization) can be applied during training to help combat overfitting. [120] A more
recent regularization method applied to DNNs is dropout regularization. In dropout,
some number of units are randomly omitted from the hidden layers during training. This
helps to break the rare dependencies that can occur in the training data. [121]
The dominant method for training these structures has been error-correction training
(such as backpropagation with gradient descent) due to its ease of implementation and
its tendency to converge to better local optima than other training methods. However,
these methods can be computationally expensive, especially for DNNs. There are
many training parameters to be considered with a DNN, such as the size (number of
layers and number of units per layer), the learning rate and initial weights. Sweeping
through the parameter space for optimal parameters may not be feasible due to the
cost in time and computational resources. Various 'tricks' such as using mini-batching
(computing the gradient on several training examples at once rather than individual
examples)[122] have been shown to speed up computation. The large processing
throughput of GPUs has produced significant speedups in training, due to the matrix
and vector computations required being well suited for GPUs. [5] Radical alternatives to
backprop such as Extreme Learning Machines,[123] "No-prop" networks,[124] training
without backtracking,[125] "weightless" networks,[126] and non-connectionist neural
networks are gaining attention.
unpredictable by the automatizer, the automatizer is forced in the next learning phase
to predict or imitate through special additional units the hidden units of the more slowly
changing chunker. This makes it easy for the automatizer to learn appropriate, rarely
changing memories across very long time intervals. This in turn helps the automatizer
to make many of its once unpredictable inputs predictable, such that the chunker can
focus on the remaining still unpredictable events, to compress the data even further.[16]