Computational Neuroscience - A First Course - 2013

Download as pdf or txt
Download as pdf or txt
You are on page 1of 142

Springer Series in Bio-/Neuroinformatics 2

Hanspeter A. Mallot

Computational
Neuroscience
A First Course
Springer Series in Bio-/Neuroinformatics

Volume 2

Series Editor
N. Kasabov

For further volumes:


http://www.springer.com/series/10088
Hanspeter A. Mallot

Computational Neuroscience
A First Course

ABC
Hanspeter A. Mallot
Department of Biology
University of Tübingen
Tübingen
Germany

ISSN 2193-9349 ISSN 2193-9357 (electronic)


ISBN 978-3-319-00860-8 ISBN 978-3-319-00861-5 (eBook)
DOI 10.1007/978-3-319-00861-5
Springer Cham Heidelberg New York Dordrecht London

Library of Congress Control Number: 2013939742

c Springer International Publishing Switzerland 2013


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’s location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of pub-
lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect
to the material contained herein.

Printed on acid-free paper


Springer is part of Springer Science+Business Media (www.springer.com)
Foreword

Computational approaches have been an ever increasing trend in neuroscience re-


search since the publication of Norbert Wiener’s seminal book in 1948. Already in
its title, Cybernetics: or control and communication in the animal and the machine,
this book makes a central claim which is valid to this day: Information processing
in biology and technology is based on the same principles and procedures, at least
if an appropriate level of description is selected. This “essential unity of problems
... whether in the machine or in living tissue” (p. 11 of the 1961 edition) implies
two research directions, i.e. the application of mathematics and systems theory to
the analysis of neural systems and, reversely, the application of neural procedures to
the solution of technical problems. The first approach proved to be very successful.
The analysis of sensors and sensoric processing, motor control, and the principles of
learning and adaptation in modular systems are now well understood and parame-
terized, largely based on available mathematical tools. The opposite approach went
through ups and downs. In the attempt to exploit the alleged superiority of neural
systems, subproblems such as pattern recognition, localization, or learning in neural
nets were singled out and their respective neural substrates were analyzed—often
with disappointing results: whenever a well conditioned objective function could
be defined, optimal realizations by means of technical algorithms and procedures
turned out to be more efficient. More recently interest has shifted to machine learn-
ing and statistical learning theory, keeping only loose connections to neuroscience
and biology. There have, however, been attempts to make neuroscience and compu-
tation re-converge, mostly by means of a unified theoretical treatment of the various
subproblems mentioned above. The present book is part of this endeavor.
Good scientific books pave a way of thinking which makes concepts understand-
able, quantifies the facts, and puts them in adequate perspective. Professor Mallot
has done just this. Starting with the intricate electrochemical properties of single
cells, he proceeds to the connectivity of neurons. Their space-time dependent re-
sponse properties represent the essence of all neural systems. The learning depen-
dent variations of these nets are also treated as an essential aspect. The associative
memory is one example of this property. To understand the coupling of real spa-
tially two-dimensional structures, one has to include some additional aspects like
VI Foreword

population coding or the geometric and parametric mappings that facilitate compu-
tations in two-dimensional representations. The application-oriented claim of cyber-
netics is treated in a chapter on artificial neural nets. The author, who also published
important contributions to the entire theoretical descriptions of neural systems, has
the rare talent to put facts and their relations in mathematical terms in a way that
is easily understood. Also, the handling of the problems with continuous as well
as with discrete methods and the introduction of necessary non-linearities into the
description can be captured without special training.
Recent progress in experimental methods has provided exciting new insights into
the structure of the brain on all levels of complexity and elucidated the efficiency
and the variability of the systems. Theoretical concepts have become more profound
and more adequate for dealing with biological constraints. Self-organization and
life-long adaptation of the brain proved to determine both its anatomical structures
and the way data and knowledge are represented. The underlying processes—and
with that the decomposition of tasks—seem to be based on a relatively uniform
set of learning procedures and anatomical architectures (such as cortices), both fa-
voring the flexibility of the entire systems. Thus, brains and the information they
contain are in a permanent process of adaptation to their current goal. This process
of self-organization seems to be the one driving force of both brain development
and learning.
The self-organization of neural systems is of course of substantial interest for
technical systems with their ever-increasing complexity and diversity. Self-organiza-
tion of neuron-like structures based on their innate flexibility may lead a way to
reduced design efforts. In this sense, the original quest of cybernetics is still alive,
albeit with an extended focus on the whole, behaving organism: How do neural
systems succeed to autonomously overcome their ignorance and to exploit the pre-
dictabilities of their environment for optimal behavior and, indeed, for evolutionary
success? Evolution and ecology prove, of course, that solutions to these questions
exist.
Not surprisingly, the computational methods used in the neural and behavioral
sciences cover the major part of scientific computing at large. The book focuses on
the most important approaches and lays the foundations for extended questions and
specifications. It shows that it is not only the detailed mathematical formalisms that
are crucial but also the broader concepts which allow appropriate mathematizations
of neuroscientific phenomena and concepts. This book delivers a clear and under-
standable introduction into the structure of the problems of a scientific domain that
still belongs to the most interesting fields science has to offer.

Bochum and Mainz, April 2013 Werner von Seelen


Institute for Neural Computation
Ruhr-University Bochum
Preface

Mathematical modeling in neuroscience is by now a common tool in many research


projects and laboratories; its coverage in neuroscience curricula is therefore increas-
ing. At the same time, computational neuroscience is still largely considered a do-
main of specialists who have acquired a mathematical background in physics or
engineering and subsequently came to apply these ideas to questions of the physi-
ology of neurons and neural networks and of the information processing carried out
by these structures.
This book attempts an inverse approach. It grew out of a lecture series delivered
over a period of more than ten years to graduate students in neuroscience, most
of which had backgrounds in biology, medicine, psychology, or cognitive science.
The required mathematical prerequisites for the course were therefore limited to
the level of a bachelor degree in science, i.e. to basic calculus and linear algebra.
All mathematical issues beyond this level—such as differential equations, convolu-
tion, complex numbers, high-dimensional vector spaces, or the statistical informa-
tion measure—are thoroughly motivated and explained as they arise. I tried to keep
these explanations mathematically clean but in cases had to omit subtleties which a
full mathematical treatment ought to provide. The text reflects extensive class-room
discussions and presents routes of explanation accessible also to mathematical non-
experts. This is particularly true for the chapter on Fourier transforms which is the
least neuroscientific one in the book. It is, however, instrumental for a deeper under-
standing of receptive fields and visual information processing.
The book also assumes some familiarity with the basic facts of neuroscience; if
problems arise, any textbook of neuroscience will help.
Computational methods are applicable and useful in all fields of neuroscience. In
the history of the field, three main roots can be identified, i.e. membrane biophysics,
linear and non-linear systems theory, and machine learning. In the biophysics of
neural membranes, modeling and experimentation have developed jointly from the
beginnings in the 19th century. Membrane potentials are readily treated with the
methods of electrodynamics and electrochemistry, and computational neuroscience
strongly draws on these disciplines. The theory of receptive fields, developed since
the 1940s, uses concepts from two sources, quantum mechanics and “signals and
VIII Preface

systems” in electrical engineering which may be summarized as systems theory.


Since the advent of computer vision, visual neuroscience has in turn contributed
important concepts to the computational sciences, including neighborhood opera-
tions, feature extraction, and scale space. Artificial neural networks form the most
recent addition to theorizing in the neurosciences. They initially applied concepts
from linear algebra and multivariate statistics combined with an explorative use of
computer algorithms. More recently, artificial neural networks became a part of the
boosting field of machine learning, which is not per se interested in the neurobio-
logical plausibility of its algorithms. Still, it offers a wealth of theoretical concepts
that neuroscientists can make use of. The book focuses on these three domains of
computational neuroscience. Higher level topics such as memory and representa-
tion, decision making, reasoning, etc., i.e. the whole field of cognitive modeling
could not be included in this volume.
The lecture course was accompanied by a journal club in which related classical
and recent neuroscience publications were presented and discussed. The “suggested
reading” sections refer to a number of papers used in these seminars. They are rec-
ommended for a deeper study of the subject.
Finally, I would like to thank my students for the many lively discussions in class
and during the preparation of the seminar papers. They have helped me to constantly
rethink and improve my presentation.

Tübingen, April 2013 Hanspeter A. Mallot


Contents

1 Excitable Membranes and Neural Conduction . . . . . . . . . . . . . . . . . . . . . 1


1.1 Membrane Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Hodgkin-Huxley Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Modeling Conductance Change with Differential
Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 The Potassium Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 The Sodium Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.4 Combining the Conductances in Space Clamp . . . . . . . . . . . . 11
1.3 An Analytical Approximation: The FitzHugh-Nagumo
Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Passive Conduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Propagating Action Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Receptive Fields and the Specificity of Neuronal Firing . . . . . . . . . . . . . 23


2.1 Spatial Summation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.1 Correlation and Linear Spatial Summation . . . . . . . . . . . . . . . 23
2.1.2 Lateral Inhibition: Convolution . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.3 Correlation and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.4 Spatio-Temporal Summation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.5 Peri-Stimulus Time Histogram (PSTH) and Tuning
Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2 Functional Descriptions of Receptive Fields . . . . . . . . . . . . . . . . . . . . 35
2.2.1 Isotropic Profiles: Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.2 Orientation: Gabor Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.3 Spatio-Temporal Gabor Functions . . . . . . . . . . . . . . . . . . . . . . 39
2.2.4 Why Gaussians? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 Non-linearities in Receptive Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.1 Linearity Defined: The Superposition Principle . . . . . . . . . . . 40
2.3.2 Static Non-linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
X Contents

2.3.3
Non-linearity as Interaction: Volterra Kernels . . . . . . . . . . . . . 44
2.3.4
Energy-Type Non-linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3.5
Summary: Receptive Fields in the Primary Visual
Pathway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4 Motion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.1 Motion and Flicker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.2 Coincidence Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.3 Correlation Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.4 Motion as Orientation in Space-Time . . . . . . . . . . . . . . . . . . . 53
2.5 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3 Fourier Analysis for Neuroscientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1.1 Light Spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1.2 Acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1.3 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1.4 Magnetic Resonance Tomography . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Why Are Sinusoidals Special? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.1 The Eigenfunctions of Convolution: Real Notation . . . . . . . . 63
3.2.2 Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.3 The Eigenfunctions of Convolution: Complex Notation . . . . 67
3.2.4 Gaussian Convolution Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Fourier Decomposition: Basic Theory . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3.1 Periodic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.2 The Convolution Theorem; Low-Pass and High-Pass . . . . . . 72
3.3.3 Finding the Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4 Fourier Decomposition: Generalizations . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.1 Non-periodic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.2 Fourier-Transforms in Two and More Dimensions . . . . . . . . . 78
3.5 Summary: Facts on Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . 79
3.6 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83


4.1 Elements of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.1 Activity and the States of a Neural Network . . . . . . . . . . . . . . 84
4.1.2 Activation Function and Synaptic Weights . . . . . . . . . . . . . . . 85
4.1.3 The Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1.4 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1.5 Weight Dynamics (“Learning Rules”) . . . . . . . . . . . . . . . . . . . 88
4.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.1 The Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.2 Linear Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2.4 Supervised Learning and Error Minimization . . . . . . . . . . . . . 94
4.2.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Contents XI

4.3 Associative Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100


4.3.1 Topology: the Feed-Forward Associator . . . . . . . . . . . . . . . . . 101
4.3.2 Example: A 2 × 3 Associator . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.3 Associative Memory and Covariance Matrices . . . . . . . . . . . . 103
4.3.4 General Least Square Solution . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.3.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4 Self-organization and Competitive Learning . . . . . . . . . . . . . . . . . . . . 106
4.4.1 The Oja Learning Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4.2 Self-organizing Feature Map (Kohonen Map) . . . . . . . . . . . . . 109
4.5 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5 Coding and Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


5.1 Population Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1.1 Types of Neural Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1.2 Information Content of Population Codes . . . . . . . . . . . . . . . . 114
5.1.3 Reading a Population Code: The Center of Gravity
Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.4 Examples, and Further Properties . . . . . . . . . . . . . . . . . . . . . . . 119
5.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2 Retinotopic Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2.1 Areal Magnification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.2.2 Conformal Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.2.3 Log-Polar Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Chapter 1
Excitable Membranes and Neural Conduction

Abstract. Neural information processing is based on three cellular mechanisms, i.e.,


the excitability of neural membranes, the spatio-temporal integration of activities on
dendritic trees, and synaptic transmission. The basic element of neural activity is the
action potential, which is a binary event, being either present or absent, much as the
electrical signals in digital circuit technology. In this chapter, we discuss the forma-
tion of the action potential as a result of the dynamics of electrical and chemical
processes in the neural membrane. In order to infer the closed loop dynamics from
the individual processes of voltage sensitive channels and the resulting resistive and
capacitive currents, a mathematical theory is needed, known as the Hodgkin-Huxley
theory. The propagation of neural signals along axons and dendrites is based on the
cable equation which is also discussed in this chapter. Mathematical background is
mostly from the theory of dynamical systems.

1.1 Membrane Potentials


In this section, we review the basic facts from membrane neurophysiology as a back-
ground to the mathematical models discussed later. As a starting point, we consider

Table 1.1 Ion concentrations and Nernst potentials of three ion types (from Aidley, 1998)

Ion concentrations
Ion External Internal Nernst potential
(mM) (mM) (mV)
Frog muscle K+ 2.25 124 -101
Na+ 109 10.4 +59
Cl− 77.5 1.5 -99
Squid axon K+ 20 400 -75
Na+ 440 50 +55
Cl− 560 40 -66

H.A. Mallot: Computational Neuroscience, SSBN 2, pp. 1–21.


DOI: 10.1007/978-3-319-00861-5_1 c Springer International Publishing Switzerland 2013
2 1 Excitable Membranes and Neural Conduction

e outside
 $

*1
 *


  1 1
gNa gK gl E
x
z

Cm + +

+ ENa − EK − El

 %
e inside

Fig. 1.1 Electrical circuit representing the membrane of a neuron. Redrawn from Hodgkin &
Huxley (1952)

the resting potential of nerve cells, i.e. the voltage difference of about −70 mV (in-
side negative) across a neuron’s cellular membrane. This voltage difference is due
to an unequal distribution of various ions, including the cations of sodium (Na+ )1 ,
potassium (K+ )2 , and calcium (Ca2+ ), as well as the anion chloride (Cl− ) and the
anions formed by acidic proteins (i.e. X-COO− ). While the latter cannot permeate
the cell membrane, the small inorganic ions can do so to various degrees. In fact, the
membrane contains special protein complexes, the channels, which allow the ions to
pass. These channels are generally specific for particular ion types and their open-
ing may depend on the presence or absence of specific chemical “ligands” (such as
synaptic transmitters), or on the membrane potential. Indeed, voltage dependence
of ion channel opening is the key property of neuron membranes from which the
excitability of neurons results. It is thus the basis of neural information processing.
In the steady state, i.e. if the neuron is not active, the permeability of the mem-
brane for each ion sort is constant. In this situation the ion distribution across the
membrane and the membrane potential are related by the the Nernst3 and Goldman4
equations. If only one ion sort is able to permeate the membrane, the Nernst equi-
librium is obtained if the osmotic force, resulting from unequal distribution of ions
across the membrane, just cancels the electro-motor force, resulting from the un-
equal distribution of electric charges. Consider a hypothetical membrane separating

1 The chemical symbol Na refers to the German “Natrium”, named after Lake Natron in Tanzania,
which is known for its high concentration of Na+ ions.
2 The chemical symbol K refers to the German “Kalium”. It is derived from the Arab “al-qalya”,

which refers to plant ashes in which potassium was first discovered.


3 Walter Nernst (1864 – 1941). German physicist and chemist. Nobel Prize in Chemistry 1920.
4 David Eliot Goldman (1910 – 1998). United States physicist.
1.1 Membrane Potentials 3

two compartments one of which filled with pure water and the other one filled with
the solution of a potassium salt of an acidic protein. If the membrane is permeable to
potassium cations, but not to the large anions, potassium cations will cross the mem-
brane following osmotic force, i.e. by diffusion. Since the anions cannot follow, this
leads to a depletion of positive charges, i.e. a surplus of negative charges which
drives the cations back into their original compartment. The eventual equilibrium
where both forces cancel out is given by Nernst’s equation. In a realistic neuron,
the potassium equilibrium potential is in the order of -100 mV. Similarly, a Nernst
potential of +60 mV can be calculated for the Na+ ions, see Table 1.1. Note that the
membrane potential is measured as the difference inside minus outside, -100 mV
therefore means that the inside is negative. The Nernst potential marks a thermody-
namical equilibrium, i.e. a steady state which is stable without requiring energy for
its maintenance. If the membrane potential is changed (e.g. by putting electrodes
into the compartments and applying an external voltage), the electro-motor force
changes and ions will move until a new equilibrium is reached.
If ions of more than one type can cross the membrane, the equilibrium is more
complex, since the electro-motor forces are the same for all ion sorts with the same
charge, whereas the osmotic forces depend on the concentrations of each ion sort
separately. In effect, a complex mixture of the individual Nernst equilibria is ob-
tained, described by the Goldman equation (see textbooks of physiology for details).
The resting potential of a typical neuron is determined by the concentrations of K+ ,
Na+ , and Cl− ions and their respective equilibrium (Nernst) potentials, according
to Goldman’s equation. It is not at a thermodynamical equilibrium; for its mainte-
nance sodium ions are constantly pumped outwards and potassium ions inwards by
an active (i.e. energy consuming) process, the sodium-potassium-pump located in
the cell membrane.
The electrical properties of a small patch of membrane can be modeled by the
simple electrical circuit shown in Figure 1.1. The distributions of the sodium, potas-
sium and chloride ions act as batteries generating an ion current across the mem-
brane. The conductivity of the membrane for the two ions is denoted by gNa and
gK , respectively. The arrows drawn across the resistors indicate that conductivity
can change. This change of sodium and potassium conductivity as a function of the
membrane potential is the basis of the excitability of neural membranes. The third
channel is not voltage dependent and is called “leakage”; its current is mostly due
to chloride ions. Finally, the membrane has a capacitance denoted by Cm .
The action potential is generated by the different dynamics of the changes in
membrane conductivity for the different ions. An initial depolarization causes the
sodium channel to open and sodium ions will come in, leading to a still stronger
depolarization of the membrane. After about 1 ms, the sodium channel inactivates
(i.e. closes) in a process depending on intrinsic dynamics, but not on membrane
potential. As an effect of depolarization, the potassium channels will open as well,
however, with a slower time course. When they open, potassium will move out and
the membrane is repolarized. As an effect of repolarization, the potassium chan-
nels close again. When the action potential is over, the ion distributions in the cell
and its environment will be slightly less different, due to the involved ion currents.
4 1 Excitable Membranes and Neural Conduction

Nan Nan Nan Nan


outside + + + +

? ?
gg ww ww
ww gg s s u
gg u u ww e ue s ?s sc sc u u sc sc ww sc sc ww e
u eu
?
6 6 6
6 6
inside K~
n
+ ~
n
K+ K~
n
+
K~
n
+ ~
n
K+

resting state depolarisation repolarisation


⇐= direction of propagation ⇐=

Fig. 1.2 Changes in conductivity of an excitable membrane during the course of an action
potential. Initially (left side), the depolarization traveling passively along the axon will lead
to an opening of the Na+ -channels (•). This opening is transient and finishes after about 1
ms independent of membrane potential. As a result, the membrane potential will rise quickly
(Depolarization). With a certain delay, the potassium channels (K+ ; ◦) will open. Due to the
reverse distribution patterns of the two ions, this will lead to an outflow of positive charge,
i.e. to repolarization. This in turn leads to the closure of the potassium channels. The length
of the arrows symbolizes the conductivities gNa and gK (cf. Figure1.8).

The number of ions passing during each action potential can be calculated from
the membrane capacity, the voltage change, and the elementary charge. Roughly,
about 0.1% of the ions are exchanged during one action potential. In the long run,
these differences are compensated for by the sodium-potassium-pump. A schematic
picture of the opening and closing events is given in Fig 1.2.

1.2 The Hodgkin-Huxley Theory


One use of mathematical theory in neuroscience is its capacity to predict the re-
sults of quantitative measurements from other such measurements. If successful,
this prediction is strong evidence for the underlying model. In the case of the action
potential, the theory developed by Hodgkin5 and Huxley6 (1952) provides com-
pelling evidence for the mechanism sketched out in the previous section. The basic
ideas and arguments of this theory are still valid today.
Consider once more the simple circuit presented in Figure 1.1. The total current
across the membrane is separated into the ionic currents through voltage dependent
channels, i.e. sodium (INa ) and potassium, (IK ), a leakage current (Il ) lumping all
ions passing through non-voltage dependent channels, and a capacitive current, Ic .

5 Alan Lloyd Hodgkin (1914 – 1998). English physiologist. Nobel Prize in Physiology or Medicine
1963.
6 Andrew Fielding Huxley (1917 – 2012). English physiologist. Nobel Prize in Physiology or

Medicine 1963.
1.2 The Hodgkin-Huxley Theory 5

I = Ic + INa + IK + Il (1.1)
dE
= CM + gNa (E − ENa ) + gK (E − EK ) + gl (E − El ) (1.2)
dt
For the capacitive current Ic , we have used Coulomb’s law of capacitance: The cur-
rent I flowing into a capacitor is proportional to the change of voltage E across the
capacitor, I = C dE
dt (see Figure 1.11b). The factor C is called capacitance and mea-
sured in F (Farad) = Coulomb/Volt. Intuitively, capacity describes the fact that ions
of opposite polarity attach to the membrane by mutual electric attraction. They thus
form a charge reservoir that has to be recharged if the membrane potential changes.
Recharging will take the longer, the larger the capacitance is, i.e. the more charges
are attached to the membrane.
For the ion currents, Equation 1.2 uses Ohm’s law (see Figure 1.11a). Instead
of using resistances R, i.e. the ratio of current and voltage measured in Ω (Ohm),
the Hodgkin-Huxley theory is generally formulated using the conductivities g,
which are simply the inverses of resistance, g = 1/R. The unit of conductivity is
S (Siemens) = 1/Ω. ENa , EK and El denote the Nernst potentials listed in Table 1.1.
The ion currents are proportional to the difference between the membrane poten-
tial and the ion’s Nernst potential. If both potentials are equal, the ion type is at
equilibrium.
Instead of the membrane potential E, standard formulations of the Hodgkin-
Huxley theory generally use the depolarization V which is the difference between
the membrane potential and the resting potential,

V = E − Eresting . (1.3)

Depolarization is zero if the membrane is at rest and becomes positive for excited
states. We will switch to this notion from here.
In the sequel, we will discuss the excitable sodium and potassium channels
and present phenomenological models for their dynamics. These phenomenologi-
cal models will then be combined using Equation 1.2 to generate a prediction for
the time course of the action potential.

1.2.1 Modeling Conductance Change with Differential Equations


At the core of the Hodgkin-Huxley theory of the action potential lies the fact that the
membrane conductance depends on the voltage or potential across the membrane.
For example, if the membrane is depolarized, the conductances for both potassium
and sodium are larger than at resting potential. This dependence has two parts: (i),
a static dependence g∞ = f (V ) describing the steady state, where the membrane
potential is constant over long time intervals, and (ii), a dynamic part describing
the time course of conductance change in response to changes in potential. Before
we proceed with the potassium channel, we briefly discuss how to model dynamics
with differential equations.
6 1 Excitable Membranes and Neural Conduction

Fig. 1.3 Behavior of a simple 2


a.
differential equation. a. Time

steady state g∞
1.5
course of an outer “driving
force”, expressed as the steady
1
state eventually obtained by the
system. It is switched at discrete
0.5
times. b. Exponential relaxation
to the respective steady states as
expressed in Equation 1.4. The 0.5 1 1.5 2 2.5 3 3.5 4
value g reached at the time of 2
b.

dynamic variable g
switching of the driving force
acts as initial value for the next 1.5
relaxation. c. Vector field illus-
trating the differential equation 1
1.5. The continuous lines show
three solutions for different ini- 0.5
tial values taken at t = 0.
0.5 1 1.5 2 2.5 3 3.5 4
2
c.
dynamic variable g

1.5

0.5

0.5 1 1.5 2 2.5 3 3.5 4


time

Let us assume that the membrane potential changes in a step-like fashion at time
t = 1 from a value V1 to V2 and again at time t = 2 from V2 to V3 (Figure 1.3). Such
controlled membrane potentials and membrane potential changes are studied in the
“voltage clamp preparation”, where physiological fluctuations of the membrane po-
tential are compensated for by an electronic circuit. The steady-state conductivities
g∞ (Vi ) are shown in Figure 1.3a. The actual conductivities g(t) will differ from g∞
in that they need some time to change between the steady state levels. A plausible
time course of conductivity showing the non-zero switching times (finite switching
speed) appears in Figure 1.3b. The curve shows an exponential relaxation which
between two stepping times might be formally described by
t − t0
g(t) = g∞ + (go − g∞ ) exp{− }. (1.4)
τ
Here, go is the initial value of g at time to , g∞ is the steady state of the voltage
applied during the current interval, and τ is a time constant determining the rate of
1.2 The Hodgkin-Huxley Theory 7

approach towards the steady state. Note that the approach to the steady state is the
faster (the curve is the steeper), the larger the deviation from the steady state is.
A disadvantage of this equation is the fact that it can deal only with step changes
of membrane potential occurring at discrete points in time, but not with continuous
changes as they occur during the action potential. A formal description that holds for
both step changes and continuous changes is derived from the observation that the
rate of change, i.e. the slope of g(t) should be proportional to the current deviation
from the steady state given by g(t) − g∞:

g (t) = −k(g(t) − g∞) for some constant k. (1.5)

Here we write g∞ for the steady state, keeping in mind that it depends on V . What-
ever changes V (t)—and therefore g∞ —undergoes over time, Equation 1.5 keeps
valid. An illustration of Equation 1.5 is given in Figure 1.3c. Each short line marks
a local slope g (t) calculated from Equation 1.5. Each curve drawn in a way that the
local lines in Figure 1.3c are tangents to the curve is a solution of Equation 1.5.
Equation 1.5 is an example of a differential equation, relating the value of a
function to its derivative. The solutions of differential equations are functions (or
families of functions), not numbers. By taking the derivative of g in Equation 1.4
and comparing it to g itself, it is easy to show that this function solves Equation 1.5
if k is set to 1/τ . We start by taking the derivative of Equation 1.4

1 t − to
g (t) = − (go − g∞ ) exp{− }. (1.6)
τ τ
Inserting g and g (again from Equation 1.4) into Equation 1.5 we obtain
1 t − to t − to
− (go − g∞ ) exp{− } = −k(g∞ + (go − g∞) exp{− } − g∞ )
τ τ τ
which is satisfied if we set k = −1/τ .
Equation 1.4 is called an analytical solution of the differential equation 1.5 be-
cause it is formulated as a named mathematical function, i.e. the exponential. The
sketched procedure, i.e., guessing a solution and proving that it satisfies the differen-
tial equation if appropriate choices of some variables are made, is a standard way of
finding analytical solutions. In contrast, numerical solutions are sequences of func-
tional values (i.e. numbers) obtained at discrete time-steps. We will see below that
for the full Hodgkin-Huxley system, only numerical solutions exist.

1.2.2 The Potassium Channel


Individual channels can be studied in voltage clamp preparations using axons in
solutions with different concentrations of the sodium and potassium ions. We will
not discuss the methods for distinguishing the ions here (see for example Aidley,
1998). Figure 1.4 shows the result of such a measurement where the membrane
potential is stepped from resting potential to a depolarization of 109 mV at time
8 1 Excitable Membranes and Neural Conduction

Fig. 1.4 Potassium conduc-

K+ conductance (m S / cm2 )
tance change (gK ) in the
15
voltage clamp experiment (in
milli-Siemens per square cen-
timeter). Membrane potential is 10
stepped from resting potential
by 109 mV (depolarization)
at time t = 0 ms and stepped 5
back to resting potential at time
t = 10 ms.
0
0 5 10 15 20
time (milliseconds)

conc. = 1 − n conc. = n conc. = n4


y y αn (V )
- -
 
y y
βn (V )

Fig. 1.5 Kinetics modeled by Equations 1.7 and 1.8 (potassium channel). Left: pool of inac-
tivated subunits. Middle: activated subunits, portion of subunits in activated state is n. Right:
Channel formed of four subunits. Probability of four subunits joining in one place is n4 .

t = 0 ms. Conductance rises to a plateau, with the highest speed of change occur-
ring not at the beginning (as would be the case for the simple relaxation equation
studied above), but at some intermediate value. This behavior implies that the solu-
tion should have an “inflection point”, i.e. the point of highest speed after the onset
of the “relaxation”. This inflection point will be important for the modeling of the
channel switching behavior. At time t = 10 ms, the membrane is repolarized and the
conductance relaxates to zero, this time without an inflection point. If other depo-
larizations are chosen, the shape of the curve changes slightly. This is to say that the
response is non-linear.
Hodgkin and Huxley (1952) suggested to model this behavior by a two-step pro-
cess including (i) a simple relaxation process described by an axillary variable n and
(ii) a fourth order interaction term needed to model the inflection point found in the
voltage clamp response:

gK (t) = ḡK n(t)4 (1.7)


dn
(t) = αn (1 − n(t)) − βnn(t). (1.8)
dt
Here ḡK is a constant (about 22 milli-Siemens per centimeter squared) depend-
ing neither on time nor on voltage. It represents the maximum conductivity if all
potassium channels are open, i.e. the number of channels present in the membrane.
1.2 The Hodgkin-Huxley Theory 9

Equation 1.7 means that four channel subunits have to meet in order to form a chan-
nel. The probability of finding four subunits in one spot is equal to the probability
of finding one, raised to the fourth power (as long as the subunits move indepen-
dently in the membrane). Equation 1.8 assumes that the channels can be in one of
two states, one favorable of forming a channel and one unfavorable. The dimen-
sionless number n is the fraction of channel subunits being in the favorable state;
it varies between 0 and 1. αn and βn are called “rate constants” which depend on
voltage, but not on time. They specify the likelihood (the rate) at which the channel
subunit switches from the unfavorable state into the favorable state (αn (V )), or back
(βn (V )). These rate constants thus incorporate the voltage dependence of channel
formation. αn (V ) can be expected to grow with V whereas βn (V ) should decrease.
The equation is illustrated as a chemical reaction kinetic in Figure 1.5.
The exponent four in Equation 1.7 has been chosen to reproduce a subtle feature
of the conductivity step response shown in Figure 1.4. While n(t) will show simple
exponential relaxation, n4 (t) nicely fits the inflection point occurring in the rising
branch but not in the falling one. It is remarkable that molecular biology has since
confirmed the involvement of just four subunits in the formation of the potassium
channel, as predicted from the voltage clamp measurements by Hodgkin & Huxley
(1952).
Observe that Equation 1.8 has the same structure as Equation 1.5 studied above,
i.e., n = −k(n − n∞), if we rearrange the terms and replace k by αn + βn and n∞ by
αn /(αn + βn).
dn αn
(t) = − (αn + βn )(n(t) − ). (1.9)
dt    αn + βn
k   
n∞

This implies that the values of αn (V ) and βn (V ) can be measured by performing


voltage clamp experiments with different voltage steps. For each voltage step (V )
applied, we take the fourth root to go from g/ḡ to n; the time constant of the resulting
exponential decay curve then corresponds to αn (V ) + βn(V ) while the steady-state
value corresponds to αn (V )/(αn (V ) + βn (V )). From these two measurements, αn
and βn themselves can be calculated. As expected, αn increases with higher voltage
(i.e. more channel subunits change into the activated state) while βn decreases with
higher voltage (i.e. more channel subunits change to inactivated state). The detailed
dependence of the rate constants on voltage can be found in Hodgkin & Huxley
(1952).

1.2.3 The Sodium Channel


The response of the sodium conductance to a step in membrane potential of 109 mV
is shown in Figure 1.6 (heavy line). It differs from the potassium response in its
quicker rise and in its overall phasic behavior, i.e. the conductance goes back to
zero after a few milliseconds of depolarization. If the membrane is repolarized at a
time when conductance is still high, the drop in conductance is very steep (thin line
in Figure 1.6).
10 1 Excitable Membranes and Neural Conduction

Na+ conductance (mS / cm2 )


Fig. 1.6 Sodium conductance
15
change (gNa ) in the voltage
clamp experiment (in milli- 12.5
Siemens per square centime-
10
ter). Membrane potential is
stepped from resting potential 7.5
by 109 mV (depolarization)
5
at time t = 0 ms. The thin
line shows the conductance if 2.5
the membrane is repolarized at
0
time t = 0.39 ms. 0 2 4 6 8 10
time (milliseconds)

conc. = 1 − m conc. = m
y y αm (V )
-

y βm (V ) conc. = m3 h

I -
conc. = 1 − h conc. = h t
|

αh (V )
-
 t
|
βh (V )

Fig. 1.7 Kinetics modeled by Equation 1.12 (sodium channel). Circles: inactivated “parti-
cles”; boxes: activated particles. Filled symbols: activation particles, open symbol: inhibition
particles.

The phenomenological model presented for this dynamics by Hodgkin & Huxley
(1952) uses two auxiliary variables, one (m) modeling the raise (activation gate) and
one (h) modeling the drop (inactivation gate) of conductance:

gNa (t) = ḡNa m(t)3 h(t) (1.10)


dm
(t) = αm (1 − m(t)) − βmm(t) (1.11)
dt
dh
(t) = αh (1 − h(t)) − βhn(t). (1.12)
dt
As before, the rate constants αm , βm , αh , and βh depend on depolarization, but not on
time. They can be obtained from voltage clamp measurements, as explained above
for the potassium channel.
Figure 1.7 shows a state diagram of the Hodgkin-Huxley kinetics of the sodium
channel. The match with the actual molecular mechanisms is not as close as for the
1.2 The Hodgkin-Huxley Theory 11

potassium channel. Indeed, more complex models of sodium channel kinetics have
been suggested. For a recent discussion, see Milescu et al. (2008).

1.2.4 Combining the Conductances in Space Clamp


So far, we have considered how the channel conductivity depends on membrane
potential and used the voltage clamp preparation to formalize this relation. Chan-
nel conductivity of course affects ion currents across the membrane which in turn
will change membrane potential. The whole process is thus a feedback loop of
membrane potential, membrane conductivity, membrane current and again mem-
brane potential, which we have artificially disconnected by means of the voltage
clamp preparation. In this section, we will now proceed to consider the closed loop
situation.
The first step to this end is to look at a complete action potential in the so-
called space-clamp situation where the potential is kept constant over a small patch
of membrane but may vary over time. This can be achieved by inserting a highly

100 30mV 7mV


80
depolarization V (t) [mV]

60

40

20
6mV
0

0 2 4 6 8 10

25
conductances
gNa
20
mS/cm2

15 gK
10
5
0
0 2 4 6 8 10
time (milliseconds)

Fig. 1.8 Solutions of the space-clamp equations for external stimuli V0 occurring at time
t = 0. Top: Three solutions for the depolarization resulting from external stimuli V0 = 6 mV,
7 mV, and 30 mV above resting potential. The threshold for spike generation is between 6
and 7 mV. Spikes generated by small and large super-threshold stimuli look the same (“all or
nothing” rule of spike generation). Bottom: conductance changes resulting from a stimulation
with 7 mV.
12 1 Excitable Membranes and Neural Conduction

conducting copper wire into an axon which will equilibrate all differences in mem-
brane potential occurring along the axon. The whole patch will thus exhibit the
action potential in synchrony.
In the space-clamp situation, no current will flow in axial direction since poten-
tials are constant over space. Therefore, there can also be no current loops involving
an axial component, which in turn excludes net currents across the membrane. We
may thus assume I = 0 in Equation 1.2 and obtain the following set of four coupled
ordinary differential equations:

dV
− CM (t) = ḡK n(t)4 (V (t) − VK) +
dt
ḡNa m(t)3 h(t) (V (t) − VNa ) + ḡl (V (t) − Vl ) (1.13)
dn
(t) = αn (V (t)) (1 − n(t)) − βn(V (t)) n(t) (1.14)
dt
dm
(t) = αm (V (t)) (1 − m(t)) − βm(V (t)) m(t) (1.15)
dt
dh
(t) = αh (V (t)) (1 − h(t)) − βh(V (t)) h(t) (1.16)
dt
The coupled system of non-linear differential equations 1.13 – 1.16 does not have
analytical solutions, i.e. we cannot derive a closed formula for the time-course of
the action potential. However, for the overall argument, it suffices to compute nu-
merical solutions, i.e. sampled functions approximating the solution, which can
then be compared to the measurements. Numerical solutions are obtained by first
specifying “initial values” for each of the four variables V, n, m, and h. For V , the
initial value is simply the external stimulus. For the auxiliary n, we observe that
dn/dt should be zero if the membrane is at rest. Therefore, we obtain from equa-
tion 1.14: nrest = αn (0)/(αn (0) + βn (0)) where the value 0 in the arguments refers
to deplorization. Analogous results are obtained for m and h. The initial values are
inserted in the right side of the equations. Each equation then gives a value for the
slope of the respective state variable at time 0. With this slope and the initial value,
the values at a time t is estimated by linear extrapolation. The resulting new esti-
mates are again inserted in the equations, leading to new slope values and in turn
to new estimates of the state variables at time 2t. This procedure is iterated until
the time interval considered is exhausted. A more elaborate version of this scheme,
known as Runge-Kutta-integration, was used to produce the plots in this text.
Figure 1.8 shows a numerical solution of the system 1.13 – 1.16 for an external
stimulus of V (0) = 7 mV. The depolarization V is plotted in the upper part. The
shape of the action potential is in very good agreement to the shape measured in
space-clamp experiments using squid axons. The corresponding time course of the
conductances is shown in the lower part of Figure 1.8.
If different external stimuli V (0) are chosen, two different types of behavior are
observed (Figure 1.8). If V (0) is 6 mV or less, the effect is very small; the stimulus
is sub-threshold. For stronger stimuli, an action potential is generated which looks
more or less equal for all super-threshold stimuli. Stronger stimuli result in earlier
1.3 An Analytical Approximation 13

0.8 0.8

0.6 v(0) = 0.8 0.6 v(0) = 0.2


w(0) = 0.0 w(0) = 0.0
0.4 0.4

0.2 0.2

0 0

0 2 4 6 8 10 0 2 4 6 8 10

time (milliseconds) time (milliseconds)

Fig. 1.9 Solutions of the FitzHugh-Nagumo system with Ia = 0, a = 0.3, b = 0.8, γ = 1.0.
Left: start at v = 0.8, right: start at v = 0.2.

action potentials, not in bigger ones. The Hodgkin-Huxley equations thus capture
the all-or-nothing behavior of neural excitation.

1.3 An Analytical Approximation: The FitzHugh-Nagumo


Equations7
In the Hodgkin-Huxley equations, the sodium conductance gNa has a much shorter
time constant than the variables governing gK . In an approximation, one can assume
that dm/dt = 0. Further, the slow time constant of h might be approximated by
assuming h(t) ≡ ho , a constant. With these simplifications, we obtain from Equa-
tions 1.13 – 1.16 a simplified (two-dimensional) system whose qualitative proper-
ties can be studied in the so-called FitzHugh-Nagumo equations (cf. FitzHugh 1961,
Murray 2002):
dv
(t) = f (v(t)) − w(t) + Ia (1.17)
dt
dw
(t) = b v(t) − γ w(t) (1.18)
dt
f (v) := v (a − v) (v − 1) (1.19)

Ia is an externally applied current and a, b, γ are constants.


The FitzHugh-Nagumo equations are much simpler than the Hodgkin-Huxley
equations in that they employ only two (rather than four) coupled variables; in fact,
there exist analytical solutions. Examples of numerical solutions are shown in Fig-
ure 1.9. We will now discuss some of the qualitative properties of the dynamics of
neural excitation, which can be studied best with the FitzHugh-Nagumo equations.
Figure 1.10a shows the so-called phase portrait of the FitzHugh-Nagumo system.
Here, the variables v and w are plotted along the axes (note that this is intuitive only
for two-dimensional systems). Time is not explicitly represented in this plot. The
7 This section may be skipped on first reading.
14 1 Excitable Membranes and Neural Conduction

w
0.5
0.4

0
=
0.3

t
/d
dw
0.2
d v/
0.1 dt
=0
0

0 0.2 0.4 0.6 0.8 1


a.

w 0.6
0.5
0.4
dv
0.3 /d
t =
0.2 0
0
=

0.1
t
/d
dw

0 0.2 0.4 0.6 0.8 1


b. v

Fig. 1.10 Phase portraits of the FitzHugh-Nagumo system. a. Non-oscillating case. Param-
eters as in Figure 1.9. The thin lines are the “null isoclines” dw/dt = 0 and dv/dt = 0. The
arrows give the initial direction of solutions starting from the respective points. The heavy
lines show the two solutions from Figure 1.9. The converge to the intersection of the iso-
clines which forms a stable fixed point. b. Oscillator. Parameters a = 0.2, b = 0.2, γ = 0.2,
Ia = 0.352. Note that the v-isocline have moved upwards. The intersection of the isoclines
is no longer a stable point. Two solutions are shown approaching a stable limit cycle from
within and from outside.

idea is that each point in the plane represents a state (or phase) of the system and
the dynamics is represented by movements in that plane. For example, the solutions
shown in Figure 1.9 are represented in Figure 1.10a as trajectories, (v(t), w(t)).
We can now look for states (i.e. pairs of numbers (v, w)) for which the system of
equations 1.17 – 1.19 stops moving. Such points in the phase plane are called fixed
points or steady states. From Equation 1.17 we obtain:

dv
= 0 ⇐⇒ w = f (v) + Ia (1.20)
dt
dw b
= 0 ⇐⇒ w − v (1.21)
dt γ
1.4 Passive Conduction 15

These curves are called null-isoclines and are also plotted in Figure 1.10a. They
intersect at the only fixed point (0, 0). This fixed point is stable in the sense that
all arrows in a vicinity of it have a component in the direction of the fixed point.
Therefore, the system will approach the fixed point and rest there.
If we add an applied current Ia , the curve dv/dt = 0 is shifted upwards. Depend-
ing on the slope of both curves (dv/dt = 0 and dw/dt = 0), there can be either one
or three intersection points, i.e. points where both derivatives vanish. Figure 1.10b
shows a case where only one such intersection point exists. The parameters are
chosen such that it is the inflection point of the curve dv/dt = 0. For the parame-
ters chosen, the fixed point is not stable. Trajectories starting in a neighborhood of
the fixed point move away from it and enter a limit cycle orbiting the fixed point.
Trajectories starting further out approach the limit cycle from outside. This result
shows that the FitzHugh-Nagumo system can produce oscillating responses when
activated by an applied current. The transition from a stable fixed point to an unsta-
ble fixed point with limit cycle, effected by the change of a parameter, is called a
Hopf-bifurcation.
The qualitative behavior described here has been shown to occur for the Hodgkin-
Huxley system as well, using numerical simulations. Oscillatory responses of ex-
citable cells are known for example from the pacemaker cells in the sinoatrial and
other nodes of the heart; here, the constant incoming current Ia is realized by strong
leakage channels carrying a constant sodium current.

1.4 Passive Conduction


In the space-clamp, we have assumed that the potentials and currents are constant
along a patch of membrane. In reality, a depolarization at a point x of an axon will
spread laterally by means of an axial current ia , leading to a depolarization of ad-
jacent membrane patches. This process is called passive conduction since it does
not involve active (voltage dependent) channels; it occurs in dendritic, somatic and
axonal parts of the neuron alike. In dendrites, it is usually the only type of con-
duction, since voltage dependent channels are generally missing. In axons, passive
conduction of depolarization will initiate action potentials in neighboring membrane
sections, causing a propagation of the action potential along the fiber. Due to the re-
fractory period of the membrane, the propagation will be directed. In order to derive
the equations governing this spread and propagation, we need to remember some
basic physics as is summarized in Figure 1.11.
Some of the quantities that we will have to consider differ from the situation in
simple circuits in that they are continuous, or distributed in space. For example, the
axial, or longitudinal, resistance within an axon or dendrite is the bigger, the longer
the considered part of the fiber will be. Therefore, we have to consider a resistance
per centimeter. The resistance across the membrane will vary inversely with the
considered membrane area or fiber length. Since we consider only cylindrical fibers,
we have to use a quantity with the dimension Ohms times centimeter. The current
passing the membrane will be proportional to membrane area or fiber length. Finally,
16 1 Excitable Membranes and Neural Conduction

membrane potential and axial currents are independent of fiber length. A summary
of the quantities involved and their dimensions is given in Table 1.2.
The basic relations can be inferred from inspection of Figure 1.12. We start by
noting that the membrane potential will be a function of axon length and time, as
are the specific currents involved. We therefore denote them as V (x,t), ia (x,t), and
im (x,t), respectively. In Figure 1.12, the length variable is discretized, but we will
use a continuous representation involving spatial and temporal derivatives. Follow-
ing Ohm’s law (Figure 1.11a), the axial current ia (x,t)is proportional to the potential
change in axial direction, i.e.

1 ∂V
ia (x,t) = − (x,t), (1.22)
ra ∂ x
where ra is the axial resistance measured in Ω/cm. The curly d (∂ ) denotes partial
derivatives which are needed here since V depends on space and time, whereas the
derivative is taken only with respect to space. We will not elaborate on partial deriva-
tives here; for a mathematical definition see Figure 4.8a and Equations 4.20, 4.21.
The axial current will decrease for larger x since part of it will leak across the
membrane, thus forming a membrane current im . From Kirchhoff’s node rule (Fig-
ure 1.11c), we have:
∂ ia
−im (x,t) = (x,t). (1.23)
∂x
Substituting for ia from Equation 1.22, we obtain

1 ∂ 2V
im (x,t) = (x,t). (1.24)
ra ∂ x2
The membrane current im is composed of two components, a resistive, or leak
current (following Ohm’s law, Figure 1.11a) and a capacitive current (following
Coulomb’s law, Figure 1.11b). We can write

Table 1.2 Quantities involved in the cable equation. Only cylindrical membrane tubes are
considered and all lengths are axial. See also Figure 1.12.

Symbol Interpretation Dimension


V membrane potential V (Volt)
ia current flow along the fiber axis A (Ampere)
im current flow across the membrane per fiber length A / cm
ra axial resistance per fiber length Ω (Ohm) / cm
rm membrane resistance times fiber length Ω cm
(1/membrane conductance per fiber length)
cm membrane capacitance per fiber length F (Farad) / cm
1.4 Passive Conduction 17

k
' k
$ ' $ ?
V V I2
u
-I R -I C - I1 I3 

dV
C =I
V =RI dt ∑k Ik = 0
a. b. c.

V2
Fig. 1.11 Rules for electric net- 'x
{
 $
works. a. Ohm’s law: The cur-
rent I through a resistor is
proportional to the voltage V u u
across the resistor. The ratio is ' $
called resistance, R, measured
in Ω (Ohm) = V (Volt) / A
V1 V3
(Ampere). b. Coulomb’s law: x
{
 x
{

the change of voltage is pro-
portional to the current flowing
into the capacitor. The ratio C & %
is the capacitance, measured in u u
F (Farad) = C (Coulomb) / V
(Volt). c. Kirchhoff’s node rule: V4
The sum of all currents flowing &x
{
 %
into one node is zero. If out- ∑k Vk = 0
d.
bound currents are considered,
the signs have to be changed ac-
cordingly. d. Kirchhoff’s mesh
rule: the sum of all voltages
in a mesh adds to zero, if all
voltages are measured either in
a clockwise or in a counter-
clockwise sense.

1 ∂V
im (x,t) =
V (x,t) + cm (x,t). (1.25)
rm ∂t
We can now equate the above two expressions for im and obtain

rm ∂ 2V ∂V
V (x,t) = (x,t) − rm cm (x,t). (1.26)
ra ∂ x2 ∂t
This is a partial differential equation known as the cable equation. If we assume
that the potential is clamped to Vo at location x = 0, and consider the steady-state
solution (i.e. ∂ V /∂ t = 0), we obtain the simpler, ordinary differential equation8

8 “Ordinary” is here the opposite of “partial”. An ordinary differential equation contains only
ordinary derivatives, as opposed to partial ones. The differential equations studied so far were all
of the ordinary type.
18 1 Excitable Membranes and Neural Conduction

rm d 2V
V (x) = (x) (1.27)
ra dx2
which has the solution  
−x
V (x) = Vo exp  . (1.28)
rm /ra

inside
ia (x1 ) ia (x2 )
r - r - r r
ra im (x1 ) ra ra
?
r r r r

cm rm cm rm cm rm cm rm
r r r r

r r r r
outside
Fig. 1.12 Electrical circuit for modeling the passive propagation of activity along an axon.
ia axial current, im membrane current, ra axial resistance, rm membrane resistance, cm mem-
brane capacitance.

Eq: 1.28describes an exponential decline of the local potential Vo with a “length


constant” rm /ra . In typical axons, it takes values in the order of a millimeter. Since
rm grows with the circumference of an axon while ra grows with its cross-sectional
area, thicker axons have larger length constants. The length constant determines
the range of passive conductance. For example, the distance between the nodes of
Ranvier may be the larger, the larger the length constant is. A dendritic tree whose
size is well below the length constant may be treated as equipotential in dendritic
summation. The influence of dendrite passive conductance on the electric properties
of the neuron have been studied by Rall (1962).
In dynamic cases, e.g., in the propagation of action potentials on axons or passive
conduction of postsynaptic potentials on dendrites, the capacitor will also play an
important role. The speed of recharging the capacitor depends on the time constant
τ = rm cm also appearing in Equation 1.26. It is largely independent of cable diam-
eter but can be reduced by reducing the membrane capacitance. Myelin sheets, be-
sides increasing membrane resistance, also have this effect of reducing membrane
capacitance, since they increase the distance between charges in the intracellular
and extracellular space. Therefore, passive conductance at internodia (i.e. between
Ranvier nodes) is faster than in non-myelinated axons of equal thickness.
The spatio-temporal cable equation (1.26) is hard to solve analytically; for a dis-
cussion of the issue see Jack et al. (1975) and Tuckwell (1988). The best-studied
case concerns a constant membrane current delivered from time t = 0 at location
x = 0. In this case, the membrane potential will approach the steady state described
1.6 Summary and Outlook 19

by Equation 1.28. As it turns out, the approach to this steady state is the slower, the
further away a given membrane patch is from the stimulation site. Overall, however,
the speed is governed by the time constant τ = rm cm .

1.5 Propagating Action Potentials


With the cable equation, we can now formulate the Hodgkin-Huxley Equation 1.13
in the space-variant case, i.e. for propagating action potentials. For active mem-
branes, the leak current V /rm in Equation 1.25 has to be replaced by the membrane
current passing through the channels, i.e. ḡK n(t)4 (V −VK ) for the potassium channel
plus the respective terms for the sodium channel and leakage. We thus obtain

1 ∂ 2V ∂V
(x,t) = cm (x,t) + ḡK n(t)4 (V (t) − VK ) +
ra ∂ x 2 ∂t
ḡNa m(t)3 h(t) (V (t) − VNa ) + ḡl (V (t) − Vl ). (1.29)

Equations 1.14 – 1.16 remain unchanged, except for the fact that the variables V , n,
m, and h now depend additionally on spatial position x. Equation 1.29 is a partial
differential equation which would be very hard to solve even numerically. One sim-
plification used by Hodgkin and Huxley (1952) uses the fact that action potentials
do not change their shape as they travel along an axon. I.e., if V (x,t) is an action
potential traveling with velocity θ , its shifted version after time Δ t and distance θ Δ t
must look the same: V (x,t) = V (x − θ Δ t,t − Δ t). Thus, V (x,t) can be written as a
one-dimensional function f (x − θ t). Taking the second derivatives of f with respect
to x and t, it is easy to see that the wave equation

∂ 2V 1 ∂ 2V
= (1.30)
∂ x2 θ 2 ∂ t2
must hold. We can therefore reduce Equation 1.29 to the ordinary differential
equation

1 ∂ 2V ∂V
(t) = CM (t) + ḡK n(t)4 (V (t) − VK ) +
ra θ ∂ t
2 2 ∂t
ḡNa m(t)3 h(t) (V (t) − VNa) + ḡl (V (t) − Vl ). (1.31)

Numerical solutions of Equation 1.31 can be obtained if the parameter θ is selected


in a suitable way. The solutions again reproduce the empirical shape of the action
potential.

1.6 Summary and Outlook


In this chapter, we have studied the formation and propagation of action potentials
in four steps:
20 1 Excitable Membranes and Neural Conduction

1. Voltage clamp experiments: The switching dynamics of voltage dependent ion


channels has been studied in an open loop preparation (voltage clamp), where
current flowing through the channels does not affect membrane potential. This
resulted in the kinetic models of the potassium and sodium channels.
2. Closing the loop in the space-invariant case: In the “space clamp” preparation,
the loop is closed again but the potentials do not propagate in space. The theory of
the case shows that standard theory of electric circuits and the models of channel
kinetics developed before suffice to explain the time course and “all-or-nothing”
property of the action potential.
3. Space dependence is first analyzed for passive membranes (i.e. membranes with-
out voltage dependent channels), resulting in the cable equation (core-conductor
theory). The cable equation explains the interaction of axial and cross membrane
currents.
4. Propagation of action potentials is modeled by combining the cable equation with
the kinetic channel models derived from the voltage clamp experiments.
The work sketched out in the chapter is a basis of a most active field of neuroscience.
In different parts of the nervous system as well as in muscle fibers and sensory sys-
tems, a large number of different ion channels have been identified whose kinet-
ics and switching behavior are still modeled along the lines pioneered by Hodgkin
and Huxley. In the patch clamp preparation, electrical properties of tiny membrane
patches, containing individual channels, can be studied, thus combining in one prepa-
ration the much more macroscopic voltage- and space-clamp techniques discussed
above. Electrical properties of large neurons with bifurcating dendrites, soma, and
bifurcating axons, are studied in compartmental modeling in which the neuron’s
anatomy is broken down into a tree of cylindrical compartments each of which fol-
lows the electrical theory presented here (Rall 1962, Hines & Carnevale 1997).

1.7 Suggested Reading


Books
Aidley, D. J. (1998). The Physiology of Excitable Cells. 4th edition, Cambridge
University Press, Cambridge, UK.
De Schutter, E. (ed.) (2010). Computational Modeling Methods for Neuroscientists.
The MIT Press, Cambridge, MA.
Gerstner, W. and Kistler, W. (2002). Spiking Neuron Models. Single Neurons, Pop-
ulations, Plasticity, Cambridge University Press, Cambridge, UK.
Jack, J. J. B., Noble, D., and Tsien, R. W. (1975). Electric Current Flow in Excitable
Cells. Clarendon Press, Oxford.
Murray, J. D. (2002). Mathematical Biology – An Introduction. 3rd edition, Springer
Verlag, Berlin. (Section 7.5)
Koch, C., and Segev, I. (1998). Methods in Neuronal Modeling. From Ions to Net-
works. 2nd edition, The MIT Press, Cambridge, MA.
1.7 Suggested Reading 21

Tuckwell, H. (1988). Introduction to Theoretical Neurobiology (2 Vols.). Cambridge


University Press, Cambridge.

Original Papers
Hines, M. L. and Carnevale, N. T.(1997) The NEURON simulation environment.
Neural Computation 9, 1179 – 1209
Important review paper on compartmental modeling, introducing both the basic
concepts and the simulation tools.
Hodgkin, A. L. and Huxley, A. F. (1952). A quantitative description of membrane
current and its application to conduction and excitation in nerve. Journal of
Physiology (London), 117:500 – 544.
Classical account of the mechanism and theory of the action potential, which by
and large describes the state of the art to this day. The present chapter closely
follows this masterpiece of computational neuroscience.
Milescu, L. S., Yamanishi, T., Ptak, K., Mogri, M. Z., and Smith, J. C. (2008).
Real time kinetic modeling of voltage-gated ion channels using dynamic clamp.
Biophysical Journal 95:66 – 87.
This paper combines the “dynamic clamp” (replacement of specific channels
by real-time computed currents) and Markovian modeling of channel switching
dynamics.
Naundorf, B. Wolf, F., and Volgushev, M (2006) Unique features of action potential
initiation in cortical neurons. Nature 440:1060 – 1063.
Action potentials in cortical neurons are shown to rise even steeper than pre-
dicted in the Hodgkin-Huxley theory (initially developed for the squid). The
authors suggest an additional interaction between adjacent sodium channels not
considered in the original theory.
Rall, W. (1962). Electrophysiology of a dendritic neuron model. Biophysical Jour-
nal, 2:145 – 167.
Early and influential paper on the electrodynamics of (passive) conduction and
their dependence on dendritic geometry. Essential for deeper understanding of
dendritic summation and compartmental modeling.
Wilders, R. (2007). Computer modelling of the sinoatrial node. Medical & Biologi-
cal Engineering & Computing, 45:189 – 2007
Review of models of neural activity in pacemaker neurons of the heart. Rhyth-
mic activity can occur in Hodgkin-Huxley systems with high leak currents. The
paper summarizes and compares specific models of the sinoatrial pacemaker
oscillations.
Chapter 2
Receptive Fields and the Specificity of Neuronal
Firing

Abstract. Neuronal firing does not occur at random. In the sensory parts of the
brain, firing is triggered by properties of various input stimuli, such as the position
of a light stimulus in the visual field, the pitch of a tone, or the appearance of a famil-
iar face. In “associative” areas of the brain, specificities for more abstract concepts
have been found including cells representing place (e.g., in rodent hippocampus),
or numerosity (e.g., in primate prefrontal cortex). In the motor parts of the brain,
neurons have been found that fire preferably prior to pointing movements of the
arm into a certain direction. That is to say, these neurons are specific for particu-
lar motor actions. In the sensory domain, specificities are quantified in terms of the
receptive field, which can be defined as the totality of all stimuli driving a given
neuron. The receptive field is measured by correlating the activity of a single neu-
ron with externally measurable parameters of the stimulus. This approach is known
as reverse correlation, since stimuli will always preceed the neuronal activity. The
concept of correlation between neuronal activity and external measurables, however,
generalizes easily to the motor system, leading to the concept of the motor field of
a neuron. In this sense, visual receptive fields can be considered as an example for
neuronal specificity at large. In this chapter, we discuss the basic theory of visual
receptive fields which can be extended to similar concepts in other sensory, motor,
or associative areas. The theory is closely related to linear systems theory applied
to spatio-temporal signals, i.e. image sequences. Mathematically, it rests on integral
equations of the convolution type which will be introduced in due course.

2.1 Spatial Summation


2.1.1 Correlation and Linear Spatial Summation
The Superposition Principle
If an electrode is placed in the retina, spikes can be recorded from retinal ganglion
cells. Clearly, the spiking activity of these cells will depend on the stimulus, i.e. on
the pattern of light shone on the receptor layer of the retina while recording. In the

H.A. Mallot: Computational Neuroscience, SSBN 2, pp. 23–55.


DOI: 10.1007/978-3-319-00861-5_2 c Springer International Publishing Switzerland 2013
24 2 Receptive Fields and the Specificity of Neuronal Firing

simplest case, a little spot of light is moved across the retina covering just a few
receptors at any one time. Usually, rather than shining the light directly onto the
receptor cells, one will have the animal watch a projection screen onto which the
stimulus is presented. The receptive field is then defined as the portion of the visual
field, from which the firing activity can be modulated by means of the visual stim-
ulus (see Figure 2.1). Mathematically, this is described by a receptive field function
φ (x, y) which for each position of the stimulating spot of light (x,y) specifies the
elicited response of the neuron.
What happens if we use larger spots of light, or, more interestingly, patterned
stimuli? If we just use two spots of light, the simplest idea is that the activities
that are elicited by each of the lights in isolation, are added together. This scheme
is called “linear spatial summation” or “linear superposition”. Let us describe the
stimulus image, i.e. the spatial distribution of light intensity on the screen, by the
function I(x, y) which takes values between 0 (no light, black) and 1 (brightest pos-
sible light, white). A square light spot with intensity 1 covering the interval from
x = 0 to x = 1 and y = 0 to y = 1 can be described by the function

1 if 0 ≤ x < 1 and 0 ≤ y < 1
d(x, y) := . (2.1)
0 otherwise

We will call this function the pixel function in the sequel.


A spot of light appearing at location (xo , yo ) would then be given by

I(x, y) = d(x − xo, y − yo ). (2.2)

It is worthwhile verifying this equation with some numerical examples. Indeed, the
idea that functions can be shifted by subtracting appropriate constants in their argu-
ment is important for the understanding of this chapter.
Let us now assume that a light stimulus delivered at location (x1 , y1 ) elicits a
response φ1 of the neuron, and that a light stimulus delivered at location (x2 , y2 )
elicits a response φ2 :

I(x, y) = d(x − x1 , y − y1) ⇒ e = φ1 (2.3)


I(x, y) = d(x − x2 , y − y2) ⇒ e = φ2 (2.4)

where e denotes the excitation of the neuron, measured in spikes per second. In the
case of linear spatial summation, presenting both spots together will elicit the sum
of the individual responses:

I(x, y) = d(x − x1, y − y1 ) + d(x − x2, y − y2) ⇒ e = φ1 + φ2 . (2.5)

Similarly, if we increase the light intensity of the spot by a factor of λ ∈ IR, a linear
summation predicts a λ -fold output:

I(x, y) = λ d(x − x1 , y − y1) ⇒ e = λ φ1 . (2.6)


2.1 Spatial Summation 25

e - e - e 
e e e
- e - u -

z
r

a. b.

c. d.

0.8 1
0.6
0
0.4 5 5
0.2 -1

0 0
-5 -5

0 0
-5 -5

e. 5
f. 5

Fig. 2.1 Visual receptive fields defined as an experimental protocol. a. The visual field is
scanned line by line with a small spot of light. b. Simultaneously, neural activity is recorded
from a retinal ganglion cell. c., d. Each time the neuron fires, a point is plotted at the location
of the light spot at the time of firing. c. A neuron which is active if the light falls within
a central disc and is inhibited if the light falls in the outer annulus (on-center off-surround
organization). d. A neuron which is inhibited if the light falls within a central disc and is
activated if the light falls in the outer annulus (off center on surround organization). The
receptive field is the total area, from which the neuron’s activity can be modulated. The areas
shown in gray are rough estimates of the excitatory and inhibitory parts of the receptive fields.
e., f. Center-surround receptive field function φ (x, y) for the two neurons depicted in Figures
c, d. The gray areas c, d are delimited by contour lines of φ .
26 2 Receptive Fields and the Specificity of Neuronal Firing

Linear spatial summation is a theoretical concept defined by the above two equa-
tions. It is useful to describe the behavior of neurons as long as the stimulus inten-
sities are not too large. I.e., if one spot of light already suffices to drive the cell into
saturation, additivity will not obtain. Also, since neither light intensities nor spike
rates can take negative values, Equation 2.6 becomes meaningless for negative λ .
However, even in clear non-linear cases, as will be discussed in later sections of
this chapter, linear descriptions are used as first approximations or components of a
more comprehensive model.

The Receptive Field Function


Let us now turn back to the problem of full, complex input images. We may think of
the image as a set of pixels each of which corresponds to a light spot d(x − xi , y − y j )
with intensity Ii j = I(xi , y j ) and write down the trivial equation1
I J
I(x, y) = ∑ ∑ Ii j d(x − xi, y − y j ), (2.7)
i=1 j=1

where I and J are the total numbers of pixels in the x- and y− direction.
Assume now that we have measured the neuron’s response to an isolated light
spot delivered at each of the pixel locations xi , y j and denote the result by φ (xi , y j ).
Then, by the linear superposition principle, we can construct the response as the
sum of all responses to the individual pixels and obtain
I J
e = ∑ ∑ I(xi , yi )φ (xi , yi ). (2.8)
i=1 j=1

Finally, if we assume very large numbers of pixels n and very small pixels, we may
write


e= I(x , y ) φ (x , y ) dx dy (2.9)
−∞ −∞
where the integral is taken over the entire visual field.
Equation 2.9 is called the correlation of I with φ (or vice versa). The function
φ (x, y) is called the receptive field function or response profile of the neuron. For
each point (x, y) in the visual field, φ (x, y) is the response of the neuron, elicited
from this location. Equation 2.9 is a so-called improper integral which evaluates to
a number (e ∈ IR). It can be thought of as the volume under the two-dimensional
“surface” I(x , y )φ (x , y ) where x , y represent the visual field. The visual field is
thus modeled as an infinite plane over which the integral is taken. Since the receptive

1 The sum sign ∑ (capital Greek letter sigma) is an abbreviation defined as ∑Ii=1 zi := z1 + z2 +
. . . + zI where I ∈ IN is the number of components of the sum. The double sum is needed here to
cover all pixels of a rectangular grid.
2.1 Spatial Summation 27

field function can be assumed to be zero outside a finite “support” region (the recep-
tive field), the existence of the integral is not a problem.2 The name “correlation”
for Equation 2.9 is based on the idea that for each location (x , y ) the two numbers
I(x , y ) and φ (x , y ) form a data pair. If we omit the subtraction of the means and
the normalization with the standard deviations, Equation 2.9 indeed describes the
statistical correlation between the variables I and φ .
For the receptive field function, further names are in use, including kernel and op-
erator. “Kernel” is a general term for a fixed factor (φ ) in an integral equation where
a second factor (I) is considered a variable input function. An operator is a mapping
from a set of input functions (images) to a set of output functions (pattern of neu-
ral activity as are studied below). Since all linear operators can be expressed as an
integral equation with a suitable kernel (Riesz representation theorem of functional
analysis), the term operator is sometimes also used for the kernel.
The transition from the spatially discrete formulation in Equation 2.8 to the con-
tinuous formulation in Equation 2.9 can also be applied to Equation 2.7. In this case,
the pixel-function d(x, y) has to be replaced by the so-called δ - or Dirac3 impulse
which can be thought of as the limit of a sequence of pixel functions with decreasing
pixel size and increasing amplitude, such that the volume remains constantly one. It
is mathematically defined by the relation


δ (x) f (x)dx = f (0) for all functions f . (2.10)
−∞

From this definition, it immediately follows




f (x) = δ (x − x ) f (x )dx , (2.11)
−∞

i.e. correlation with δ (x − x ) “cuts out” the function value at position x. In the
next section, we will introduce the notion of convolution. Eq. 2.11 then says that
convolution with the δ -Pulse does not change the input function f ; i.e., it is the
neural element of the convolution operation.
Strictly speaking, δ (x) is not a proper function, because it does not take a finite
value at x = 0. Mathematically rigorous definitions are provided in the theory of
functionals (functional analysis).
Figure 2.1e,f shows the function φ for typical retinal ganglion cells. It has cir-
cular symmetry and is divided into a center and a surround in which the function
φ takes different signs. Negative values of φ mean that the cell is inhibited when a
stimulus is delivered at that location. In measurements, inhibitory regions show up
by a reduction of the spontaneous activity (cf. Figure 2.1) or by interactions with


2 In functional analysis, all functions f (x) for which −∞ f 2(x)dx exists (i.e., evaluates to a number)
are contained in the Hilbert space L2 . Since the visual field is actually not the infinite plane but
rather a finite subset of the sphere, all continuous functions defined on the visual field satisfy this
condition.
3 Paul A. M. Dirac (1902 – 1984). English physicist. Nobel Price in Physics (with E. Schrödinger)

1933.
28 2 Receptive Fields and the Specificity of Neuronal Firing

a second, excitatory stimulation. The on-center/off-surround profile shown in Fig-


ure 2.1e is also known as “Mexican hat function”. It is usually fitted by a difference
of two Gaussian4 functions, as explained in Section 2.2.1.
Visual receptive fields generally do not respond to homogeneous light distribu-
tions, i.e. stimuli of the form I(x, y) ≡ Io . In Equation 2.9, this amounts to the con-
straint


φ (x, y) dy dy ≈ 0. (2.12)
−∞ −∞
This means that for most receptive field functions, the total weights of excitatory
and inhibitory regions cancel out.
The receptive field function in the correlation operation can also be considered
as a template for a particular sub-pattern or “image feature” for which the neuron
is a detector. In the case of an off-center isotropic neuron appearing in Figure 2.1f,
the neuron will be maximally responsive if the image depicts a black dot in front
of a bright background with the diameter of the dot matching the diameter of the
receptive field center. The functional description of the “optimal stimulus” (i.e. the
stimulus most strongly driving the neuron) looks just like the receptive field function
itself. The correlation operation is therefore also known as a “matched filter” for the
image feature described by the neuron’s receptive field function. Mathematically,
this is an instance of the the “Cauchy-Schwarz-inequality”, stating that the correla-
tion equation is maximized if the two functions involved have the same shape. An
early account of this idea is given in the famous paper entitled “What the frog’s eye
tells the frog’s brain” (Lettvin et al. 1959) which describes retinal ganglion cells as
“bug detectors” signaling to the brain dark dots in front of bright backgrounds, i.e.
putative flies.

2.1.2 Lateral Inhibition: Convolution


So far, we have considered a single neuron together with its receptive field. We now
turn to the case where a sensory surface (e.g., the retina) projects to an entire layer of
neurons each of which has its own receptive field. Clearly, this is the case found in
many sensory systems, in particular if representations are organized as topological
maps.
Figure 2.2 shows a simple circuitry known as lateral inhibition. The sensory in-
put shown in the top part is propagated to a set of neurons in a retinotopic layer.
In addition, each connection line branches to make inhibitory inputs to the nearest
neighbors of each neuron. If a constant input is given to the network, direct excita-
tory influences and indirect inhibitory influences cancel and the output layer will be
silent. If, however, patterned input is given to the network, intensity edges will be
enhanced.
Assume that a point stimulus is delivered to one input location of the lateral
inhibition network of Figure 2.2. As a result, a distribution of excitation and inhibi-
tion will arise in the output layer which is positive for the neuron directly innervated

4 Carl Friedrich Gauß (1777 – 1855). German mathematician.


2.1 Spatial Summation 29

input I(x)
6

-
position x
1  1  0  0  0  1  0  0 
q q q q q q q q
@ @ @ @ @ @ @ @ @
@ @ @ @ @ @ @ @ @
? @ 
@  ? @  ? @  ? @  ? @  ? @ 
? @ 
? @
0.0 0.5 −0.5 0.0 −0.5 1.0 −0.5 0.0
       
? ? ? ? ? ? ? ?
output e(x)
6

J

J
@ J
@
-
J
@ @
JJ
position x
a. @ @


φ (x)
6
Q q
 Q - @
PP QQ
@
? @ 
 
  
  
@ ? ? ?
@
?
@  ψ (x)
6
Q
 Q -
 PP QQ
b. ? c.

Fig. 2.2 Lateral inhibition. a. Input-output relations for a simple example. Links ending in an
arrow are excitatory (weight +1); links ending in a dash (
) symbolize inhibitory connections
(weight −0.5). The input shows a step edge (left) and a contour (right) which are enhanced
in the output. Constant input is attenuated. b. Convergence of activity to one output neuron.
The activity of this neuron is characterized by its receptive field function φ (x). c. Divergence
of activity from one input site. The resulting distribution of activity over the output layer
is the point-spread function ψ (x). The system is shift-invariant, resulting in a close relation
between point-spread-function and receptive field function, i.e. ψ (x) = φ (−x).

from the input site and negative for its neighbors which receive the lateral, inhibitory
connections. In general, we will denote such distributions as e(x, y) where (x, y) is
the location of a neuron and e(x, y) its activity. The particular distribution resulting
from a point-stimulus is called the point spread function or point image; we will de-
note it by ψ (x, y). If we have a homogeneous layout of neurons each with the same
local connectivity pattern, as is the case in Figure 2.2, the point spread functions
for different stimulation sites will be identical up to a shift in position (“translation
invariance”). As in the previous section, we may write:
30 2 Receptive Fields and the Specificity of Neuronal Firing

I(x, y) = d(x − x1, y − y1 ) ⇒ e(x, y) = ψ (x − x1, y − y1 ) (2.13)


I(x, y) = d(x − x2, y − y2 ) ⇒ e(x, y) = ψ (x − x2, y − y2 ), (2.14)

where ψ is the point spread function for a point stimulus delivered at position
(x, y) = (0, 0).
For general images composed of many point stimuli, we obtain:
I J
e(x, y) = ∑ ∑ I(xi , y j )ψ (x − xi , y − y j ), (2.15)
i=1 j=1

or, in the infinitesimal formulation





e(x, y) = I(x , y ) ψ (x − x , y − y) dx dy . (2.16)
−∞ −∞

Equation 2.16 is known as the convolution5 of I with ψ . We also write e = I ∗ ψ or


e(x, y) = (I ∗ ψ )(x, y). It is again an improper integral. A space-variant version of
Equation 2.15 was first introduced into neuroscience as a model of lateral inhibition
in the horseshoe crab (Limulus) eye and is therefore also known as the Hartline6 -
Ratliff equation (Hartline & Ratliff 1958).
In a two-layer feed-forward network, convolution describes the divergence of
neural activity. Other examples include the blurring of a projected slide (I: slide;
ψ : blurring disk) or the superposition of sand hillocks formed by sand trickling
through holes in a board (I: the pattern of holes in the board; ψ : sand hillock formed
under one individual hole). Interestingly, convolution describes also the probability
density function of a sum of two random variables, which is obtained by convolving
the two individual density functions.
The variables in Equation 2.16 can be interpreted in the following way: Cortical
coordinates (coordinates on the output layer) are given by (x, y); (x , y ) parameterize
the retina, or input layer, over which the integral is taken. ψ (x − x , y − y ) then gives
the strength, or weight, with which a stimulus delivered at retinal position (x , y )
influences the output at (x, y). Mathematically, it is easy to show that convolution is
commutative, i.e. that we can write



I(x , y ) ψ (x − x , y − y ) dx dy
−∞ −∞



= I(x − x , y − y ) ψ (x , y ) dx dy . (2.17)
−∞ −∞

This result is achieved by the substitution x− x =: x and y− y =: y . Now, the intu-
ition is slightly different: while (x, y) is still the position in the output layer, (x , y )
is the stimulus offset relative to the current receptive field center and the integral
is taken over the entire receptive field. The input function I has to be evaluated at
5 The German term “Faltung” is sometimes also found in English texts.
6 Haldan K. Hartline (1903 – 1983). American biologist. Nobel Price in Physiology or Medicine
1967.
2.1 Spatial Summation 31

(x − x , y − y ), i.e. the receptive field center in absolute coordinates ((x, y)) minus
the offset. In the end, this is just a difference in the parameterization of the later
influences, not a different operation. Generally, convolution operations are charac-
terized by the occurrence of the integration variable in both components, but with
opposite signs.
Lateral inhibition was first studied by Ernst Mach7 in relation to a perceptual phe-
nomenon now known as Mach bands. Continuous intensity changes are perceived
as roughly homogeneous areas whereas boundaries between constant intensity areas
and intensity ramps are perceived as bright or dark bands, depending on the direc-
tion of the kink between the plateau and the ramp. Mach suggested that the retina
calculates local second derivatives with respect to space, resulting in an excitation
pattern of the form 2
∂ I ∂ 2I
e(x, y) = c + . (2.18)
∂ x2 ∂ y2
The term in brackets, i.e. the sum of second partial derivatives, is known as the
“Laplacian”8 of the function I(x, y) and closely related to the convolution with a cen-
ter surround system. Indeed, the network depicted in Figure 2.2 can be considered
as calculating a spatially discretized approximation of a (one-dimensional) second
derivative. If the intensity function is equidistantly sampled to values I1 , I2 , I3 , ..., the
local derivatives can be approximated as I1 = I2 − I1 and I2 = I3 − I2. The second
derivative is then approximated as

I1 = I2 − I1 = I3 − 2I2 + I1 (2.19)

which is exactly the operation of Figure 2.2 up to a factor −1/2. The analogy be-
tween the formulations of lateral inhibition by the convolution integral or the partial
differential equation (2.18) is reflected by modeling center surround receptive fields
as Laplacians of Gaussians.

2.1.3 Correlation and Convolution


Let us now look at the receptive fields of the output neurons in the lateral inhibition
network. By inspection, we see that they are identical to the point-spread functions.
In order to compare the two types of functions, both of which allow to describe the
system behavior, we need to extend Equation 2.9 to layers. We note that φ (x , y )
was the receptive field function of the sole neuron considered in the derivation of
Equation 2.9. Let us now assume that this neuron is located at (x, y) = (0, 0) in the
output layer. In translation invariant systems, the receptive fields of other neurons
in that layer will be shifted versions of the original neuron’s receptive field. Specif-
ically, a neuron at location (x, y) will have a receptive field φ (x − x, y − y). We can
therefore generalize Equation 2.9 to layered systems:

7 Ernst Mach (1838 – 1916). Austrian physicist and philosopher.


8 Pierre-Simon Marquis de Laplace (1749 – 1827). French mathematician.
32 2 Receptive Fields and the Specificity of Neuronal Firing

e(x, y) = I(x , y )φ (x − x, y − y) dx dy . (2.20)

By comparing Equation 2.16 with Equation 2.20 for all possible images I, we find
that
φ (x , y ) = ψ (−x , −y ), (2.21)
i.e. the point-spread function and the receptive field function are mirrored versions
of each other. Note that this result holds only for translation-invariant systems. In
this case, the point-spread function describes the divergence in a network and the
receptive field functions the convergence in the same network. Clearly, divergence
and convergence have the same origin, i.e. lateral connectivity (cf. Figure 2.2b,c).
If translation-invariance does not obtain, i.e., if receptive fields of neighboring
neurons differ in systematic ways, point-spread function and receptive field function
will also differ systematically (see Mallot et al. 1990). Figure 2.3 shows a projec-
tion between two retinotopically organized neural layers with varying magnification
factor. If an input neuron is connected to many output neurons, point-spread func-
tions will be large and receptive fields will be small (upper part of Figure 2.3b).
Vice versa, if an input neuron projects only to a few output neurons, point-spread
functions will be small and receptive fields large. Similarly, large visual areas such
as V1 tend to have small receptive fields and large point-spread functions, whereas
small areas such as MT tend to have large receptive fields and small point-spread
functions.

xo

yo
xo yo

S C S C

Fig. 2.3 Receptive fields (shaded) and point images (hatched) in space-invariant (left) and
space-variant (right) projections between a sensory surface S and a central representation C.
The receptive field of a unit yo ∈ C is a part of the sensory surface. The point image of a point
xo ∈ S is a part of the central representation. In the space-invariant case (left; convolution),
receptive field and point image do not change over space. In the space-variant case (right)
magnified regions (upper part) have large point images and small receptive fields, whereas
reduced regions (lower part) have small point images and large receptive fields.
2.1 Spatial Summation 33

stimulus I(t − t  ) response e(t)


Fig. 2.4 Interpretation of the time variables -
in Equation 2.22. Note that the naming of t − t t
time variables can be changed by the trans- 
formation τ := t − t  ,t − τ = t  . In the equa- −t 
tion, the roles of I and w will be exchanged,
demonstrating the commutativity of
convolution.

2.1.4 Spatio-Temporal Summation


Neurons collect activity not only over space but also over time. To understand how
this is modeled in linear summation, we first need to consider spatio-temporal stim-
uli, i.e. movies. Such stimuli specify an intensity value for each image point and
each instant in time and may therefore be represented by a three-dimensional func-
tion I(x, y,t). The activity of the neuron at time t will in principle depend on the
complete movie up to time t. Just as we divided image space into pixels, we may
now conceptualize time as being divided into discrete time steps, as is indeed the
case in movies or video clips. The stimulus is now a large, three-dimensional ar-
rangement of individual events, i.e. light flashes, each with the size of one pixel
and lasting one time step. Let t  denote the time that passed between the instant of
recording from the neuron, t, and the spatio-temporal stimulus event whose influ-
ence on the neuron is considered. Clearly, this event took place at t − t  .
In this situation, spatio-temporal summation is characterized by a spatio-temporal
weighting function or kernel w(x, y,t  ) specifying for each position (x, y) and each
delay t  the sign and strength with which this particular stimulus event influences the
output (cf. Figure 2.4). As in spatial summation, we assume translation invariance,
which in the temporal domain is also called stationarity. It means that the system
will produce the same dynamic response every time the same stimulus is presented.
In neural systems, deviations from stationarity may be due to fast changes such as
fatigue and adaptation, or to slower processes such as plasticity, growth, or ageing.
In the stationary case, the summation of the influences reads:




e(t) = I(x, y,t − t ) w(x, y,t  ) dt  dy dx. (2.22)
−∞ −∞ o

The function w(x, y,t) is called the spatio-temporal receptive field profile.
Note that space and time are treated differently in this equation. Temporally, we
have a convolution, reflecting the fact that the neuron will be influenced by the tem-
poral evolution of the stimulus. In space, we simply sum over the entire receptive
field. A spatial convolution would be required only if a uniform layer of neurons
with identical, but space-shifted receptive fields were to be modeled. The inner in-
tegral in Equation 2.22 is taken over positive values of t  only. This is in contrast
to spatial convolution, where x and y varied between + and −∞. The reason for
this restriction is of course that only stimuli occuring before the time of measure-
ment can influence the excitation. This constraint is called “causality”. It can also be
34 2 Receptive Fields and the Specificity of Neuronal Firing

accounted for by setting the values of w(x, y,t  ) to 0 for all values t  < 0, meaning
that an input event cannot have effects in the past. With this convention, the integral
may be taken from −∞ to ∞.
An important special case of Equation 2.22 is obtained if we consider a purely
spatial summation, followed by a temporal summation process applied only to the
result of the spatial process. In this case, the spatio-temporal receptive field function
can be split up into two parts, a spatial profile φ (x, y) and a temporal weighting
function g(t):
w(x, y,t) = φ (x, y)g(t). (2.23)
The contribution of a stimulus delivered at time t −t  to the neuron’s activity at time t
is given by a function g(t  ) where t  specifies how much time has passed between the
delivery of the stimulus and the instant t at which response is measured. Generally,
g(t  ) will be small for large delays t  and maximal for small or intermediate values
of t  . Spatio-temporal kernels that can be split up in this way into a spatial and a
temporal factor are called “separable”.

2.1.5 Peri-Stimulus Time Histogram (PSTH) and Tuning Curves


Spatio-temporal receptive field functions provide complete descriptions of a neu-
ron’s behavior, at least if linearity and translation-invariance in time is assumed.
In practical experiments, however, other descriptors such as the peri-stimulus time
histogram and the tuning curve are more commonly used. In this section, we will
briefly discuss how these measurements relate to the receptive field function, again
for the linear case.
The peri-stimulus time histogram is the average time-dependent rate of action
potentials (or spike rate) measured during a stimulus presentation cycle. Let the
single spike-train measured during the i-th stimulus presentation be (ti1 ,ti2 , ...,tini )
where ni is the total number of spikes in that spike-train and ti j denotes the time
when the j-th spike in the i-th trial occurred. Then, the spike rate is a piecewise
constant function of time; between two subsequent spikes, it evaluates to the inverse
of the inter-spike interval:
1
pi (t) := for j = 0, ..., ni − 1. (2.24)
ti, j+1 − ti j

The PSTH can be calculated as the pointwise average of this function over all m
stimulus presentation cycles,

1 m
PSTH(t) = ∑ pi (t).
m i=1
(2.25)

Assume now that the neuron is stimulated by a spatio-temporal pattern I(x, y,t) last-
ing for some duration T . After each presentation, an inter-stimulus-interval (ISI)
TISI is included during which the presentation screen is left blank. We need to as-
sume that the ISI is long enough for the response to decay to zero, i.e. w(x, y,t) = 0
2.2 Functional Descriptions of Receptive Fields 35

for t > TISI . In this situation, the time course of the neuronal response e(t) as cal-
culated from Equation 2.22 is the linear prediction of the PSTH.
The PSTH of a given neuron will of course be different for different stimuli. In
many experiments, the stimuli used can be thought of as a parameterized family
of functions, with a parameter specifying the orientation of a bar or grating, the
speed of a moving bar, the length or a bar etc. For each value of the parameter, a
PSTH e p (t) results where p denotes the parameter. From this PSTH, we calculate
a characteristic number such as the total number of spikes, the maximal spike rate,
etc. The dependence of this number on the parameter is known as the tuning curve
of the neuron for the stimulus parameter considered. For the maximal spike rate, we
obtain
ρ (p) := max e p (t), (2.26)
t

where the Greek letter ρ (rho) denotes the tuning curve. Alternatively, tuning may
be defined as the total number of spikes elicited with each parameter setting,

tmax
ρ (p) := e p (t)dt. (2.27)
0

Clearly, peri-stimulus time histogram and tuning curve are measuring protocols
which remain to be useful and well-defined if the linear model of the receptive field
presented here fails.

2.2 Functional Descriptions of Receptive Fields


2.2.1 Isotropic Profiles: Gaussians
So far, the examples of receptive field functions were isotropic, or circular symmet-
ric functions, as are found in retinal ganglion cells, and the lateral geniculate nucleus
(LGN). They are usually modeled using the well known Gaussian function

1 1
φ (x, y) = exp − 2 (x + y ) .
2 2
(2.28)
2πσ 2 2σ

The factor 1/(2πσ 2) has been chosen such that the integral over the entire function
equals unity. The only free variable is the scale σ which determines the width of the
bell-shaped surface (see Figure 2.7b). Receptive fields differing in scale will react
to stimulus features of different size or granularity. Contour lines of the Gaussian
are curves satisfying the relation (x2 + y2 )/2σ 2 = const, i.e. they are circles. In par-
ticular, the contour line at elevation e−1/2 φ (0, 0) has radius σ . In one-dimensional
versions, i.e. sections along the coordinate axes, the arguments ±σ mark the inflec-
tion points of the bell-shaped curve.
Adding a temporal component to the receptive field function, we choose the func-
tion g(t) from Equation 2.22 to be
36 2 Receptive Fields and the Specificity of Neuronal Firing

Fig. 2.5 Non-separable, spatio-temporal 0.5


weighting function of the center-surround 0.25
type (Equation 2.30, with c1 = 0.1, c2 = 0
0.05, τ1 = 10, τ2 = 20, σ1 = 1, σ2 = 3).
-0.25
a. Temporal development in the center
(w(0, 0,t)). b. – g. Spatial weighting func- -0.5
tion w(x, y,to ) for various times. -0.75

-1
0 20 40 60 80 100
a. t

1 1 1
0.5 0.5 0.5
0 4 0 4 0 4
-0.5 2 -0.5 2 -0.5 2
-1 -1 -1
-4 0 -4 0 -4 0
-2 -2 -2
-2 -2 -2
0 0 0
2 2 2
b. to = 0 c. to = 3 d. to = 6
-4 -4 -4
4 4 4

1 1 1
0.5 0.5 0.5
0 4 0 4 0 4
-0.5 2 -0.5 2 -0.5 2
-1 -1 -1
-4 0 -4 0 -4 0
-2 -2 -2
-2 -2 -2
0 0 0
2 2 2
e. to = 12 4
-4
f. to = 24 4
-4
g. to = 48 4
-4

t t
g(t) = exp(− ). (2.29)
τ τ
Here, τ is called a time constant. The function is 0 for t = 0, rises to its maximum
at t = τ and declines asymptotically to zero for larger t. If the neuron reacts fast,
it will have a short time constant, while slow responses may be modeled by larger
time constants.
The standard model of the center-surround type receptive field of retinal ganglion
cells can now be formulated:
2  2 
−t/τ1 x + y2 −t/τ2 x + y2
w(x, y,t) = c1te exp − − c2te exp − . (2.30)
2σ12 2σ22

It is a difference of two Gaussians with unequal width, where each Gaussian is mul-
tiplied with a temporal summation function. For an on-center, off-surround ganglion
cell, we may choose an excitatory mechanism with small width σ1 and small time
constant τ1 and an inhibitory mechanism with larger width σ2 > σ1 and larger time
constant τ2 > τ1 . A function with these specifications is illustrated in Figure 2.5.
Leaving the temporal component out, Equation 2.30 reduces to a so-called “differ-
ence of Gaussians” (DoG) or “Mexican hat” function plotted in Figure 2.1e,f.
2.2 Functional Descriptions of Receptive Fields 37

1
σ = 0.5 1
σ = 1.0
ω =π ω =π
0.5 0.5

0 0

−0.5 −0.5

−1 −1
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

1
σ = 0.5 1
σ = 1.0
ω = 2π ω = 2π
0.5 0.5

0 0

−0.5 −0.5

−1 −1
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Fig. 2.6 One-dimensional Gabor functions (Equation 2.31, 2.32) for various choices of scale
σ and frequency ω . In each plot, the even (left-right mirror symmetric) heavy line is the
cosinusoidal Gabor function, the odd (point symmetric heavy line) is the sinusoidal Gabor
function. The dashed lines show the enveloping Gaussians, Note that the top right and bottom
left plots are identical up to a horizontal compression. This reflects the fact that in these cases
the product σ ω is the same. It is proportional to the number of “side lobes” included in the
function.

Note that Equation 2.30 is also an example of a non-separable spatio-temporal


function, as long as τ1 = τ2 . Other types on spatio-temporal non-separability include
spatio-temporal orientation (i.e. receptive fields shifting over time) or changes in
preferred orientation.

2.2.2 Orientation: Gabor Functions


In the visual cortex, most receptive fields are oriented, i.e., their response to elon-
gated stimuli (light bars) changes if these bars are presented vertically, horizontally,
or at any other orientation.
Mathematically, there are different ways of generating oriented receptive field
functions that can be used to model such behavior. The function type which is now
almost exclusively used is the so called Gabor9 function, generated by multiply-
ing a (one- or two-dimensional) sinusoidal with a Gaussian envelope (see Pollen &
Ronner 1983, Jones & Palmer 1987). In one dimension, we have:
9 Dennis Gabor (1900 – 1979). Hungarian physicist (holography). Nobel Prize in Physics 1971.
38 2 Receptive Fields and the Specificity of Neuronal Firing

a. c. e.
yy

xy
y
ω

b.
d. f.

Fig. 2.7 a. Plane wave f (x, y) = cos (ωx x + ωy y). The vectors on top show the coordinate
axes (x,y) and the wave direction ω = (ωx , ωy ). The unlabeled vector is the orthogonal di-
rection (−ωy , ωx ) along which f is constant. b. Gaussian function ϕ (x, y) = exp{−(x2 +
y2 )/2σ 2 }. c. Cosine Gabor function obtained by multiplying f and ϕ . d. Same as c, shown
as contour plot. e. Sine Gabor function obtained by phase-shifting f by a quarter cycle and
multiplying it by ϕ . f. Same as e, shown as contour plot.


x2
gc (x) := cos(ω x) exp − 2 (2.31)


x2
gs (x) := sin(ω x) exp − 2 . (2.32)

The cosinusoidal Gabor function is “even”, i.e., it satisfies the condition that gc (x) ≡
gc (−x); the sinusoidal Gabor function is odd, gs (x) ≡ −gs (−x) (cf. Figure 2.6). As
before, σ determines the width of the Gaussian window while ω is the frequency
of the underlying sinusoidal; since gc and gs are functions of a spatial variable, ω
is called a spatial frequency. Gabor functions are also sometimes called wavelets,
because of their local wave-shaped look.
Orientation is added when the Gabor function is extended to two dimensions. To
see this, we consider first the “plane wave”

f (x, y) = cos(ωx x + ωy y) (2.33)

depicted in Figure 2.7a. It describes a corrugated surface similar to a wash board.


Sections along the direction ω = (−ωy , ωx ), i.e. along a “wave-front” are constant.
All other sections are sinusoidals with various frequencies; in particular, the frequen-
cies of sections along the coordinate axes are wx and wy , respectively. The direction
2.2 Functional Descriptions of Receptive Fields 39

with the highest frequency is (ωx , ωy ), i.e. orthogonal to the wave-front. The spatial
frequency of the two-dimensional function f (x, y) is the vector (ωx , ωy ) . As in the
one-dimensional case, the Gabor function is obtained by multiplying the sinusoidal
with a Gaussian, see Figure 2.7b. If the Gaussian is centered on a wave peak or trough,
the result will be a symmetric, cosine Gabor function (Figure 2.7c,d); if the Gaussian
is centered on a flank, an odd, or sine Gabor function will result (Figure 2.7e,f). In
principle, Gabor functions with arbitrary phase relation can be generated by shifting
the Gaussian with respect to the plane wave. The equations read:
2 
x + y2
gc (x, y) := cos(ωx x + ωy y) exp − (2.34)
2σ 2
2 
x + y2
gs (x, y) := sin(ωx x + ωy y) exp − . (2.35)
2σ 2

Gabor functions are characterized by the following parameters which model the
major specificities found in neurons in the primary visual cortex:
1. In Equation 2.35, the Gabor functions are localized at position (x, y) = (0, 0).
Shift terms (xo , yo ) can be added to shift function to arbitrary positions.
2. The scale σ determines the overall width of the receptive field.
3. The preferred orientation is given by the ratio ωy /ωx . If expressed as an angle
from the x-axis, orientation becomes tan−1 (ωx /ωy ).
4. The preferred spatial frequency is the frequency of a grating atany orientation
driving the cell most strongly. This frequency is determined by ωx2 + ωy2. Spa-
tial frequency preference is sometimes referred to as “localization in the spatial
frequency domain”.
5. The number of side lobes is determined by the product of the overall spatial
frequency and scale.

2.2.3 Spatio-Temporal Gabor Functions


In the discussion of motion selectivity, we will encounter spatio-temporal Gabor
functions. In these, the plane wave of Figure 2.7a is replaced by a propagating wave,

gc (x, y,t) = cos(ωx x + ωy y − ν t). (2.36)

Here, ν is a temporal frequency


 (dimension 1/sec) and the the speed of movement
in the (ωx , ωy )-direction is ν / ωx2 + ωy2 .
The Gaussian window function also has to be replaced by some spatio-temporal
window function. Clearly, in time, this function can only extend into the past, since
the future stimuli are unknown (“causality”). A sensible choice for the spatio-
temporal window function uses a logarithmically distorted time axis. In order to
apply the scale variable σ to both the space and the time domain, we need to make
the space and time variables dimensionless, i.e. divide them by their respective units.
40 2 Receptive Fields and the Specificity of Neuronal Firing

We call the new variables ξ , ζ , θ (read xi, zeta, and theta) and obtain (Koenderink
1988):
 
(ξ − ξo)2 + (ζ − ζ0 )2 + (ln(−θ ) − ln(−θo ))2
w(ξ , ζ , θ ) = exp for θ , θo < 0.
2σ 2
(2.37)
θo < 0 is the moment in the past, onto which the temporal window is centered. The
temporal window will be asymmetric, compressed in the forward direction (between
θo and 0) and expanded in the backward direction. Its overall size will shrink the
closer θo is to 0, i.e., the present moment.
The drifting grating of Equation 2.36 is then multiplied with the spatio-temporal
window function to generate a spatio-temporal Gabor function.

2.2.4 Why Gaussians?


Why are Gaussians and Gabor functions used to describe receptive fields? In princi-
ple, the choice of a particular mathematical function as a model of a receptive field
will be based on the ability of the function to fit the receptive field profile. Gaus-
sians and Gabor functions, however, while well suited to fit experimental data, are
also chosen for mathematical reasons. The Gaussian is special in that convolving a
Gaussian with another Gaussian yields again a Gaussian (“central limit theorem”).
Also, the Fourier transform of a Gaussian (see below) is again a Gaussian. The Ga-
bor function is optimally localized both in space and in spatial frequency. Indeed,
in the 1970s, neuroscientists controversially discussed the question whether cortical
neurons are tuned to (localized in) space or spatial frequency, assuming that both
ideas were mutually exclusive. The introduction of the Gabor function settled this
question. The Fourier transform of the Gabor function is a Gaussian centered at a
non-zero frequency.

2.3 Non-linearities in Receptive Fields


Linear neurons are a theoretical abstraction. They are special in that they are com-
pletely defined by one spatio-temporal receptive field function which predicts the
neuron’s response to all possible stimuli by means of convolution. In contrast, non-
linearity is not a well-defined concept in itself, but simply the absence of linearity,
which can occur in many ways. Before discussing some important deviations from
linearity, we start with a general definition.

2.3.1 Linearity Defined: The Superposition Principle


In the previous sections, we have modeled simple spatial receptive fields by a
weighting function φ . A stimulus occurring at position (x, y) is multiplied by the lo-
cal weight φ (x, y) and the weighted contribution is added to the neuron’s response.
More generally, we will now introduce the notation
2.3 Non-linearities in Receptive Fields 41

e = Φ (I), (2.38)

where I = I(x, y) is the stimulus (an image) and Φ is a general mapping from the
set of all images to the set of possible excitations e of the neuron. In mathematics,
a mapping from a function set into a number set is called a functional, the corre-
sponding discipline of mathematics is called functional analysis10. A functional Φ
is said to be linear if the following two requirements are satisfied (see also Equa-
tion 2.5, 2.6):
1. The response to the sum of two stimuli (I(x, y), J(x, y) is the sum of the individual
responses:
Φ (I + J) = Φ (I) + φ (J). (2.39)
Inserting from Equation 2.9, we obtain

Φ (I + J) = φ (x, y) (I(x, y) + J(x, y)) dx dy (2.40)


= φ (x, y) I(x, y) dx dy + φ (x, y) J(x, y) dx dy


= Φ (I) + φ (J),

i.e. this condition is obtained from the distributive law and the linearity of the
integral.
2. The response to a multiple of the stimulus is the multiple of the response:

Φ (λ I) = λ Φ (I) for all λ ∈ IR. (2.41)

Again, we verify this condition from Equation 2.9:


Φ (λ I) = φ (x, y) (λ I(x, y)) dx dy (2.42)


=λ φ (x, y) I(x, y) dx dy = λ φ (I),

i.e. this condition holds due to the associative law of multiplication and the lin-
earity of the integral.
These two properties taken together are known as the superposition principle. In
words, it states that in linear systems, responses to individual stimuli superimpose,
or add up, if the stimuli are presented together.
If I is taken to be the image intensity, i.e. the physical power (irradiance) im-
pinging per unit area on the retinal receptors, the response of a neuron can never be
strictly linear. Consider for example a stimulus leading to some excitation e. Linear-
ity requires that the neuron would react to the same stimulus, amplified by a factor of

10 If e is also considered a function, as was the case in the discussion of lateral inhibition, Φ in
Equation 2.38 becomes a mapping from a set of functions (input images) into another set of
functions (pattern of neural excitation). Such mappings are called operators. The definition of
linearity is extended to operators in a straight-forward way.
42 2 Receptive Fields and the Specificity of Neuronal Firing

Input - linear filter Potential-non-linearity Output -


I(x, y,t) w(x, y,t) u(t) e = f (u) e(x, y,t)

Fig. 2.8 An important class of non-linear system can be described as a cascade of a linear
system, followed by a static non-linearity. In neuroscience, L–NL-cascades are considered as
a coarse model of dendritic summation followed by spike initiation at the axon hillock.

λ , with a λ -fold activity λ e. Clearly, this will work only for small positive values of
λ , since the activity of the neuron is limited and cannot become negative. Linearity,
therefore, is often a good model as long as small stimuli (or small deviations from
some standard stimulus) are considered but usually fails for large signal amplitudes.
Even then, however, the linear case is always considered as a first approximation of
the problem.
Note that non-linearity and translation-invariance (see Figure 2.3) are indepen-
dent concepts. A system may be linear but not translation invariant, or vice versa. In
either case, however, it cannot be described by convolution. Indeed, it can be proven
that all linear, translation-invariant systems are convolution systems.

2.3.2 Static Non-linearity


The simplest type of non-linearity is described by terms like thresholding, satu-
ration, or compression and can be modeled as a sequence, or cascade of a linear
system and a so-called static non-linearity, or transfer function. In neural networks,
neurons are often considered as linear–non-linear (L–NL) cascades, where the (ap-
proximately) linear part is dendritic summation, whereas spike generation at the
axon-hillock is non-linear, but depends only on the instantaneous value of the gen-
erator potential. While potential can be below or above resting potential (hyper- vs.
depolarized soma), spike rates as described in Equation 2.24 cannot be negative.
The static non-linearity will therefore be a mapping from the real numbers IR (the
potentials) to the non-negative reals IR+o . In summary, the cascade system can be
written as

e(t) = f (u(t)) (2.43)


  
u(t) = I(x, y,t − t ) w(x, y,t ) dx dy dt (2.44)

(cf. Figure 2.8).


The function f is called a static non-linearity, because it acts on individual points
and instances of u(x, y,t), irrespective of the values at adjacent points in space or
instances in time. Typical examples of static non-linearities are shown in Figure 2.9.
Half-wave rectification simply sets negative values to zero:
2.3 Non-linearities in Receptive Fields 43

u
6
-
t
a.
e
f1 (u) 6
6
- - - -
u t
b.
e
f2 (u) 6
6
- - - -
θ u t
c.
e
f3 (u) 6
6
- - - -
uo u t
d.
e
f4 (u) 6
6
- - - -
−λ /2 u t
e.
e
f5 (u) 6
6
- - - -
u t
f.

Fig. 2.9 Static non-linearity applied to a sinusoidal. a. Sinusoidal used as an input in all
cases. b. Half-wave rectification. The boxed plot shows the static non-linearity function f (u),
the right part the effect on the sinusoidal input, i.e. f (sin u). Half-wave rectification deletes
all negative values and passes the positive values unchanged. c. Binary switching function
(Heaviside function) with threshold θ . d. Saturation function with half-saturation value uo .
e. Sigmoidal with slope λ . The value f (0) = 1/2 is the spontaneous activity of a neuron.
f. Squaring non-linearity doubles the frequency of a sinusoidal input (since sin2 t = 0.5 −
0.5 cos 2t). It plays an important role in so-called energy models of cortical complex cells
(see below).


0 if u ≤ 0
f1 (u) := , (2.45)
u if u > 0

it appears in Figure 2.9b.


44 2 Receptive Fields and the Specificity of Neuronal Firing

The binary switching function (Heaviside function11) is defined as



0 if u ≤ 0
f2 (u) := . (2.46)
1 if u > 0

Figure 2.9c shows the function f2 (u − θ ) where θ is the threshold. Note that the
threshold is modeled by a subtraction in the argument of f . This can be done with
all static non-linearities. In the case of a sinusoidal input, the threshold modulates
the ’duty-cycle’ of the binary output, i.e. the relative length of zero and non-zero
sections.
Saturation is often modeled by the equation

0 if u ≤ 0
f3 (u) := (2.47)
u+uo if u > 0
u

(Figure 2.9d). The constant uo determines the slope of the function; it can be thought
of as the input value where f3 generates the output value 0.5. Equation 2.47 has
been used to model the non-linearity of photoreceptors. In this case, uo can be used
to adjust the curve to the state of light adaptation (“Naka-Rushton non-linearity”).
The static non-linearity used most frequently in neural network modeling is the
sigmoidal (Figure 2.9e). It can be formalized in various ways, for example:

eλ u
f4 (u) = , λ > 0. (2.48)
eλ u + e− λ u
The parameter λ is the slope of the sigmoidal at u = 0. The value f4 (0) models the
spontaneous activity of the neuron. In the formulation given, f4 (0) = 1/2 for all
λ . If other spontaneous activities are desired, the function can be shifted along the
u-axis.
At first glance, the squaring non-linearity

f5 (u) = u2 (2.49)

does not seem particularly plausible as a model of neural transmission. It turns out,
however, to be very useful in the so-called energy models of complex cells which
will be discussed below. There, it can be thought of as a rectifying square (i.e.,
f (u) = 0 for u < 0 and f (u) = u2 for u ≥ 0) applied separately to an on and an off
channel processing the same signal. The sum of the two outputs can be calculated
faster by applying the two-sided squaring function to the linearly filtered input.

2.3.3 Non-linearity as Interaction: Volterra Kernels


In convolution, a stimulus can be thought to be decomposed into a set of small,
instantaneous stimuli (“pixels” and “pulses”) each of which elicits a response

11 Oliver Heaviside (1850 – 1924). British mathematician and physicist.


2.3 Non-linearities in Receptive Fields 45

described by the spatio-temporal weighting function w. By the superposition prin-


ciple, the total response is the sum of the effects of all individual pulses.
One interesting type of non-linearity is to extend the superposition principle by
assuming that there is an interaction between “pulses” occurring at different points
and instants, and that the responses to such pairs of pulses superimpose. To keep the
notations simple, we consider a purely temporal system, i.e. spatial summation is
assumed to be absent. In this case, the cell might be driven by the co-occurrence of
two stimuli at times t  and t  an example of which is given in Figure 2.13a below.
Let us denote the contribution of this stimulus pair to the cell’s activity at time 0 by
K(−t  , −t  ). Assuming further that the contributions from all such stimulus pairs
superimpose, we get

e(t) = K(t − t  ,t − t  )I(t  )I(t  )dt  dt  . (2.50)

The function K in Equation 2.50 is called a Volterra12-kernel of second order, some-


times, the term “2-way interaction kernel” is also used. It describes a general mul-
tiplicative non-linearity. For example, if K equals the Dirac impulse δ (t  − t  ), e
becomes the integral over the squared input function, i.e. the total signal power
presented. Higher order interactions can be modeled analogously, i.e. with an n-
dimensional kernel K and the product of the stimulus values at n different times
(n-way interactions). The first-order Volterra equation would be (linear) convolu-
tion.
General non-linearities can be modeled by sums of Volterra-integrals of increas-
ing order (Volterra series). This approach has in fact been used to fit neurophysio-
logical data from complex receptive fields. The advantage of this method is that it
gives a general means for the identification of non-linearities. The disadvantage lies
in its huge number of unknown variables. For a spatio-temporal Volterra kernel of
order n, a function of 3n variables has to be measured.

2.3.4 Energy-Type Non-linearity


Figure 2.10 shows four edges with the same orientation, but differing in two pa-
rameters called polarity and phase. Figure 2.10a,b shows “step edges” where the
intensity function is anti-symmetric (odd) with respect to the location of the edge.
Figure 2.10c,d shows “contour edges” whose intensity function is symmetric (even)
with respect to edge location. In analogy to the difference between the symmetric
cosine function and the anti-symmetric sine-function, the difference between step
and contour edges is called a phase difference. Optimal filters for step edges are
odd, i.e., sine-Gabor functions whereas optimal filters for contour edges are even,
i.e., cosine-Gabor functions. For each phase, the edges depicted differ also in po-
larity, leading to the following four cases: dark-to-bright (rightward increasing step

12 Vito Volterra (1860 – 1940). Italian mathematician.


46 2 Receptive Fields and the Specificity of Neuronal Firing

a. b. c. d.

Fig. 2.10 Different types of edges and receptive fields. In each frame, four receptive fields of
the simple type are shown. Black ellipses indicate inhibitory regions, white ellipses excitatory
regions. The two fields to the left have odd symmetry, the two to the right even symmetry. For
each symmetry type, the two possible polarities are also shown. The receptive field responsive
in each case is highlighted by a white circle. a., b.: Step edges. Only odd symmetric fields
of the appropriate polarity will respond. c., d.: Contour edges. Only even symmetric fields of
the appropriate polarity will respond.

edge), bright-to-dark (rightward decreasing step edge), bright-on-dark (incremental


contour), and dark-on-bright (decremental contour).
Also shown in Figure 2.10 are four receptive fields modeled as Gabor functions
which can be characterized as (i) increasing, odd (gs in Equation 2.35), (ii) decreas-
ing odd (−gs ), (iii) incremental even (gc ), and (iv) decremental even (−gc ). All four
profiles are shown for each edge and the one profile yielding the strongest response
is circled. All these receptive fields are oriented and linear, i.e. they model cortical
simple cells.
In contrast to simple cells, cortical complex cells respond to edges of all polar-
ities and phase, as long as the edges have the preferred orientation (and scale), i.e.
they behave invariant with respect to polarity and scale. Polarity invariance is an
intrinsically non-linear property as can be seen from the following consideration:
Let I+ (x) be a bright-on-dark (“incremental”) contour edge as depicted in Fig-
ure 2.10c and the first column of Figure 2.11. Let the response of some linear re-
ceptive field to this stimulus be denoted by e+ . The decremental, dark-on-bright
contour can then be written as I− (x) = Io − I+ (x) where Io is a constant. Due to
the assumed linearity, the response to this stimulus must be the difference between
the response to the constant intensity
Io and e+ . Since neurons will generally not
respond to constant stimuli at all ( φ (x, y)dxdy ≈ 0), we obtain e− = −e+ which
can be satisfied only if the neuron is not responding to the edge at all. This proves
that polarity-invariant neurons must be non-linear.
How, then, can we construct a model of a complex cell reacting equally well to
step and contour edges of either polarity? The simplest way to achieve this is to take
all four simple receptive fields shown in Figure 2.10, pass them through a rectifying
non-linearity (i.e. neglect negative outputs) and take the sum of the four input lines.

ecpx = f (I ⊗ gs ) + f (I ⊗ (−gs)) + f (I ⊗ gc) + f (I ⊗ (−gc))


2
u for u ≥ 0
f (u) = (2.51)
0 for u < 0
2.3 Non-linearities in Receptive Fields 47

where ecpx is the excitation of the complex neuron, gc and gs are the even and
odd Gabor functions, and ⊗ denotes the correlation (Equation 2.9). The static non-
linearity f could also be modeled as simple half-wave rectification, but the squaring
gives better results. Instead of writing down the on- and off-Gabor fields ±gc and
±gs separately, we observe I ⊗ (−gs,c ) = −(I ⊗ gs,c) and obtain

ecpx = (I ⊗ gs )2 + (I ⊗ gc )2 . (2.52)

Equation 2.52 is called the energy model of the cortical complex cell (Adelson &
Bergen, 1985), for a schematic depiction see Figure 2.12c. The term “energy” is not
to be taken literally, but simply reflects an analogy from the theory of alternate cur-
rent, where the sum of squared sine and cosine components is an electrical energy. It
is also related to the notion of a “signal power” introduced in the following chapter.
Although in Equation 2.52, only two receptive field functions are considered,
the formulation of Equation 2.51 is biologically more plausible, since it takes into
account the non-negativity of neuronal spike rates. In the final model, on- and off-
profiles which individually are subject to a rectifying non-linearity are combined to
a channel which behaves completely linear.
Figure 2.11 shows the energy model in more detail. The top row shows the four
edge types (two phases, two polarities). These edges are meant to appear on top of
some constant background, such that negative values are no problem. As before, we
assume that no responses to constant inputs occur. The left column shows the recep-
tive field profiles of the four types of simple cells, ±gc (x), ±gs (x). Each bar plot in
the center shows the simple cell activity ±gs,c ⊗ Δ I, where the center bar refers to a
neuron exactly centered at the edge, while the flanking bars refer to receptive fields
somewhat beside the edge. The bottom row shows the activity of an energy neuron.
Note that the response is the same for all four edge types.
The removal of phase and polarity information does not depend on the use of
Gabor functions but can be obtained with any so-called quadrature pair of receptive
field functions. A general quadrature pair consists of an odd function fo and an even
√ fe and satisfies the condition that the complex function fe + i fo , where
function
i = −1 is the complex unit, must be analytical, i.e. differentiable in the complex
plane. Or, to state this another way, the Hilbert transform must convert the even
function into the odd and vice versa. In fact, the odd/even Gabor functions only
approximately satisfy this condition.

2.3.5 Summary: Receptive Fields in the Primary Visual Pathway


• Isotropic or rotationally symmetric receptive fields are found in retinal gan-
glion cells and the LGN. They are usually modeled as differences of Gaus-
sians. Isotropic fields vary mostly with respect to their polarity (on-center vs.
off-center) and size (scale) (Figure 2.12 a, b).
• Orientation specific receptive fields of the “simple” type are found throughout
the visual cortex. In addition to polarity and scale, they vary with respect to
orientation, phase (odd vs. even) and spatial frequency (Figure 2.12 c, d).
48 2 Receptive Fields and the Specificity of Neuronal Firing

Δ I(x) Δ I(x) Δ I(x) Δ I(x)


6 6 6 6
- - - -
x x x x

g (x)
6c
-
x
simple unit x−1 xo x1 x−1 xo x1 x−1 xo x1 x−1 xo x1

−g (x)
6c
-
x
simple unit x−1 xo x1 x−1 xo x1 x−1 xo x1 x−1 xo x1

g (x)
6s
-
x
simple unit x−1 xo x1 x−1 xo x1 x−1 xo x1 x−1 xo x1

−g (x)
6s
-
x
simple unit x−1 xo x1 x−1 xo x1 x−1 xo x1 x−1 xo x1

6
-
x
complex unit x−1 xo x1 x−1 xo x1 x−1 xo x1 x−1 xo x1

Fig. 2.11 Energy model of visual neurons of the “complex” type. The top row shows from
left to right four local intensity profiles corresponding to a bright contour on dark ground,
a dark contour on bright ground, a step edge from a dark to a bright image region, and a
step edge from a bright to a darker image region. The leftmost column shows receptive field
functions of four simple cells modeled as Gabor function with even or odd phase (sine or
cosine) and positive or negative polarity. The histograms above the bar show the responses of
three such cells with receptive fields centered at x−1 , xo , and x1 , respectively. The row below
the bar shows a complex cell summing up the squared outputs of each of the simple cells.
Note that the histograms for all four input profiles are identical, indicating the complex cell’s
invariance to edge polarity and phase.

• Cortical neurons of the “complex” type are invariant to phase and polarity, but
share with simple cells the specificities for orientation, spatial frequency, and
scale. The invariance requires a non-linearity which is modeled by the energy
model (Figure 2.12 e).
2.4 Motion Detection 49

a. b.

e.

c. d.

Fig. 2.12 Visual receptive fields. a, b. Isotropic (rotationally symmetric) types found in
retina, LGN. a. On-center off-surround. b. Off-center on-surround. c, d. Oriented simple cells
of the visual cortex c. Even (cosine) Gabor-function d. Odd (sine) Gabor-function e. Cortical
complex neuron responsive to “contrast energy”.

2.4 Motion Detection


2.4.1 Motion and Flicker
Visual motion is the most important stimulus for many neurons throughout the
visual system. On the input side, motion can be defined as the coherent displace-
ment of an image patch or an image feature over time, where the amount and
direction of the displacement form the motion vector. Visual motion is thus char-
acterized by two quantities, i.e., length (speed) and direction, or the x- and y-
components of the motion vector. If we denote the local motion vector byv(x, y,t) =
(vx (x, y,t), vy (x, y,t)) , image change due to visual motion can be expressed by the
equation
I(x, y,t + dt) = I(x − vx (x, y,t)dt, y − vy (x, y,t)dt,t) (2.53)
where vdt is the motion displacement in the time interval dt.
Note that not every change in an image is a motion. For example, if the light
in a room is switched on and off, all pixel intensities change, but there is no dis-
placement of image patches and therefore no motion. Likewise, the dynamic noise
pattern appearing on a poorly tuned television set is not a motion stimulus despite its
dynamic, i.e. ever changing structure. Image change which cannot be described as
image motion is called flicker. It is a scalar quantity without a direction. In analogy
to Equation 2.53, we write

I(x, y,t + dt) = f (x, y,t) dt + I(x, y,t), (2.54)

where f denotes the flicker.


50 2 Receptive Fields and the Specificity of Neuronal Firing

By inspection of Equations 2.53, 2.54, it is clear that the measurement of visual


motion amounts to the estimating two variables per pixel (vx and vy ) whereas only
one variable is to be measured in flicker ( f ). We will not pursue the role of flicker
but continue with visual motion.
The fact that visual motion is a vector quantity implies that the tuning curve of a
motion-selective neuron depends on two stimulus parameters (speed and direction),
not just on one. With sharp tuning, the neuron operates as a detector for a particular
motion, i.e., it signals whether or not its preferred motion is present. This type of
coding is known as labeled line coding; it is different from a (hypothetical) motion
estimator, where the output would be a continuous estimate of the two-dimensional
motion vector. Clearly, no such estimator can be realized by a single neuron.

- @u
r - - - - @u
r - - -
@ @


Δx -


Δx -

I(t) I(t − Δ t) I(t) I(t − Δ t)

∗δ (t−τ ) ∗h(t)




j j

 
a. ? b. ?

Fig. 2.13 a. Simple motion detector consisting of two input lines of which one is delayed by
the time lag τ , and a threshold. The delay is shown as a temporal convolution with a displaced
δ -pulse. The stimulus is moving from left to right with a velocity v = Δ x/Δ t, leading to the
inputs I(t) and I(t − Δ t) at the two receptors. If the delay τ matches Δ t, the activations in both
lines will reach the threshold unit simultaneously. The threshold has to be adjusted such that
it is passed only by coincident inputs on both lines. The detector is specific for direction and
speed. b. In the correlation, or Reichardt detector, the delay is replaced by a temporal lowpass
filter and the threshold is replaced by a multiplication and subsequent averaging over time.
The latter two steps realize a cross-correlation between the direct and the low-passed input
line with offset zero (symbol ⊗). The complete Reichardt detector consists of two mirror
symmetric circuits of the type shown, whose outputs are subtracted to give signed motion
output.

2.4.2 Coincidence Detector


The most obvious approach to motion detection is to combine a thresholding op-
eration with a delay in one of its input lines (Figure 2.13a). If a signal, such as a
flash of light, appears first at the delayed input line and later at the non-delayed line,
2.4 Motion Detection 51

and if the time of travel matches the built-in delay, both inputs will arrive simul-
taneously and the threshold may be passed. If, however, the stimulus moves in the
opposite direction or with the wrong speed, both inputs arrive at different times and
the threshold will not be passed. This principle has been suggested as a model of
motion selectivity in rabbit retinal ganglion cells (Barlow & Levick 1965).
In order to prevent the circuit from responding to constant white input on both
lines, an appropriate preprocessing has to added such as a differentiation in time and
maybe a Difference-of-Gaussians operator in space.
Note that the delay operation is a linear, shift-invariant operation which can be
expressed as a convolution. Its impulse response is a delayed impulse, δ (t − Δ t).
Using the definition of the δ pulse, Equation 2.10, we may write

I(t  )δ (t − Δ t − t  ) dt  = I(t − Δ t). (2.55)

In Figure 2.13a, the delay operation is illustrated by its impulse response.


The delay-threshold detector is tuned sharply to the velocity Δ x/Δ t where δ x
is the distance between the receptive fields of its input lines and Δ t is the delay
included. Its velocity tuning curve thus looks just like the impulse response of the
delay process.

2.4.3 Correlation Detector


An alternative design is shown in Figure 2.13b, where the delay is replaced by a
temporal low-pass filter and the threshold is replaced by a correlation (correlation-
or Reichardt13 -detector, Eichner et al. (2011)). At any instant in time, the output of
the temporal low-pass is a mixture of its inputs at the preceeding times, weighted
by the temporal impulse response function h(t) depicted in Figure 2.13b. It thus
acts like the superposition of a whole set of weighted delay lines. Therefore, the
comparison operation in the subsequent neuron will yield response not just to one
velocity, but to many. As in the pure delay case, the tuning curve looks like the
temporal impulse response function in the low-passed input line.
The comparison operation used in this case is somewhat more elaborate than
simple thresholding. It is based on the comparison of the time dependent signals
during a extended period of time. To understand this, we have to briefly introduce
the notion of auto- and cross-correlation of functions, see Figure 2.14. In statistics,
sample correlation is defined for a set of paired data (xi , yi )i∈1,..,n as

∑n (xi − x̄)(yi − ȳ) ∑ni=1 xi yi − nx̄ȳ


cor(x, y) :=  n i=1 =  , (2.56)
∑i=1 (xi − x̄)2 ∑ni=1 (yi − ȳ)2 var(x) var(y)

where x̄ and ȳ are the average values of x and y, and var(x) and var(y) are the respec-
tive variances. In signal theory, it is customary to neglect the averaged values (i.e.,

13 Werner E. Reichardt (1924 – 1992). German biologist and physicist.


52 2 Receptive Fields and the Specificity of Neuronal Firing

f (t)
6 f (t)
6

- -
t t

r - r r - r -
τ τ τ r τ
r - r τ
r - τ
r - r

Φ f f (τ )
6 g(t)
6

-
t
-
τ

Fig. 2.14 Left: The auto-correlation for a function f and an offset τ is calculated by evaluat-
ing the function at all pairs of points with separation τ . Of the resulting pairs ( f (t), f (t + τ )
for all possible positions t), “correlation” is calculated by multiplying the values in each pair
and summing the results. Since this procedure can be repeated for each interval τ , a so-called
auto-correlation function results which is denoted Φ f f (τ ). A typical auto-correlation func-
tion is shown in the lower part of the picture. Right: The same procedure can be applied to
two different functions, in which case the result is called cross-correlation, Φ f g (τ ).

assume x̄ = ȳ = 0) and omit the division by the variances.14 For a time dependent
signal f (t), we can then consider the correlation of a value occurring at time t with
a value occurring at time t + τ for some fixed “offset” τ . For various times t, the
data pairs ( f (t), f (t + τ )) form a set of paired variables, for which correlation can
be calculated. Since we have an infinite set of data pairs indexed by the variable t,
the sum in Equation 2.56 is to be replaced by an integral and we obtain


Φ f f (τ ) = f (t) f (t + τ )dt. (2.57)
−∞

In the theory of random processes, the replacement of a population average by


a temporal average requires a property of the random process called ergodicity.
We will assume here that ergodicity obtains. Auto-correlation as defined in Equa-
tion 2.57 can be calculated for each offset τ ; thus Φ f f becomes a function of the
offset. It is easy to see that Φ f f (τ ) takes its maximal value at τ = 0, and that
Φ f f (−τ ) = Φ f f (τ ).
The cross-correlation function is defined by the same logic, only that the two
values of the paired variables are now taken from different functions; see Fig. 2.14.
For two functions f , g, it is defined as

14 Without the normalization, the operation is actually more like a covariance, defined as
cov(x, y) = (1/n) ∑(xi − x̄)(yi − ȳ). In some texts, the auto- and cross-correlation functions are
therefore called auto- and cross-covariance, respectively. We will not adopt this terminology
here.
2.4 Motion Detection 53


Φ f g (τ ) = f (t)g(t + τ )dt. (2.58)
−∞

Cross-correlation is a means to detect delays of shifts between two related functions.


For example, if g(t) = f (t + Δ t), the cross-correlation function takes its maximum
at τ = Δ (t), which can be used to estimate such delays from longer sequences of
data.
In the correlation-type motion detector, the output neuron multiplies its two in-
puts and accumulates this product over time, i.e.

e= h(t  )I(t − t  )dt  I(t − Δ t)dt (2.59)


  
h∗I
= (ΦII ∗ h)(Δ t). (2.60)

This is to say, the output of the motion detector is given by convolving the auto-
correlation function of the input with the impulse response h and evaluating the
result at τ . If the image is a white noise pattern, i.e. a noise pattern where adjacent
pixels are uncorrelated, its auto-correlation function is a δ -pulse, ΦII (τ ) = δ (τ ) and
the tuning curve of the motion detector will be ρ (v) = h(Δ x/v).

2.4.4 Motion as Orientation in Space-Time


The standard model of motion selectivity in the mammalian visual cortex (Adelson
& Bergen 1985) can be derived from the “bi-local” detectors shown in Figure 2.13a,
b by assuming that the detecting neuron receives not just two input lines, but many,
each with a specific delay, and that these inputs are summed up at the soma (see
Figure 2.15). The delays are adjusted such that for a particular speed of a stimulus
passing by the receptors, all inputs will reach the output neuron at the same time. As
before, this speed is the preferred motion speed for the neuron. Figure 2.15b shows
an alternative depiction of the same idea where the stimulus is now presented in a
space-time diagram with the spatial coordinate x as a horizontal axis and time as
the vertical axis. The diagram thus shows the temporal development of one image
line; the second spatial dimension has been left out for simplicity. In this diagram,
the stimulus “optimally” driving the neuron is a row of light spots lined up along
the line t = x/v. The same output would be generated by a continuously moving
stimulus, moving with the same speed, which in Figure 2.15b would show as a
bright bar covering the three spots. Velocity, in the space-time diagram, corresponds
to slope, or spatio-temporal orientation.
The space-time diagram of Figure 2.15b can also be interpreted as a spatio-
temporal receptive fieldfunction as discussed already in Section 2.2.3. If we allow
the input lines from each spatial position to have full-fledged temporal impulse re-
sponses, rather than just a delay element, a full spatio-temporal receptive field results
as is shown in Figure 2.15c. It depicts the spatio-temporal Gabor function defined
by Eqs. 2.36 and 2.37, without the logarithmic compression of the time axis. As
54 2 Receptive Fields and the Specificity of Neuronal Firing
-x x
-
  
?
3Δ t y
−3Δ t −3Δ t
?
2Δ t y
−2Δ t −2Δ t
?
Δt y
−Δ t −Δ t

Ri
∑
0 t 0 t
a. b. ? c. ?

Fig. 2.15 Motion selective receptive fields. a. A hypothetical neuron with three equally
spaced input sites, each with a different temporal delay (rectangular boxes). b. (x,t)-diagram
of a stimulus flashing subsequently at the three input positions of the neuron. With the shown
timing and the delays shown in a., all activities arrive simultaneously at the unit. c. (x,t)-
diagram of a spatio-temporal Gabor function (Equation 2.35 modeling the receptive field
of a motion selective simple cell. The orientation in the space-time plot corresponds to the
preferred velocity.

a receptive field function, it describes the response behavior of a simple cell tuned
for motion (rightwards), scale, spatial frequency, location, phase (cosinusoidal), and
polarity. Spatial orientation cannot be shown in the figure due to the restriction to
two dimensions, but is also included in the mathematical model.
For motion selective complex cells, the energy model (Section 2.3.4) is applied
analogously. It is built on two simple cells, one with sinusoidal and one with cos-
inusoidal phase, but with identical value for all other parameters. The outputs are
squared and summed up as in shown in Figure 2.12d above.

2.5 Suggested Reading


Books
De Valois, R. L. and De Valois, K. K. (1988). Spatial vision. Oxford Psychology
Series No. 14. Oxford University Press, Oxford, UK.
Frisby, J. and Stone, J. V. (2010). Seeing: The computational approach to biological
vision. The MIT Press, Cambridge, MA.
Mallot, H. A. (2000). Computational Vision. Information Processing in Perception
and Visual Behavior. The MIT Press, Cambridge, MA.
Marr, D. (1982). Vision. A Computational Investigation into the Human Represen-
tation and Processing of Visual Information. W. H. Freeman, San Francisco.
Rolls, E. T. and Deco, G. (2002). Computational Neuroscience of Vision. Oxford
University Press, Oxford, UK.
2.5 Suggested Reading 55

Original Papers
Adelson, E.H. and Bergen, J.R. (1985). Spatiotemporal energy models for the de-
tection of motion. Journal of the Optical Society of America A 2:284-299
Fundamental account of visual motion as spatio-temporal orientation. Derives
suitable spatio-temporal filters for motion detection and suggests biologically
plausible version constructed from spatio-temporally separable filters.
Itti, L. and Koch, C. (2000) A saliency-based search mechanism for overt and covert
shifts of visual attention. Vision Research 40:1489 – 1506
Saliency of visual field locations is defined using the overall output of a large
battery of visual filters corresponding to the so-called perceptual dimensions
orientation, granularity, motion, and color. The model is thus an application of
the receptive field theory explained in this chapter. It is still one of the standard
approaches to focal attention.
Mach, E. (1865) Über die Wirkung der räumlichen Vertheilung des Lichtreizes
auf die Netzhaut. Sitzungsberichte der mathematisch-naturwissenschaftlichen
Classe der kaiserlichen Akademie der Wissenschaften Wien 52/2, p.303 – 322.
An English translation (On the effect of the spatial distribution of the light stim-
ulus on the retina) appears in F. Ratliff, Mach Bands: Quantitative Studies on
Neural Networks in the Retina, Holden-Day, San Francisco, 1965.
This century-old paper is still recommended reading for its clear and clever ar-
gument as well as for its general discussion of psychophysics and the relation
between neural mechanisms and experienced perceptions. It introduces the ef-
fect now known as Mach bands.
Marr, D. and Hildreth, E. (1980). Theory of edge detection. Proceedings of the Royal
Society (London) B, 207:187 – 217.
This paper intuitively shows how center-surround mechanisms support early
visual processes. At the same time it gives a comprehensive account of the
Difference-of Gaussian and Laplacian-of-Gaussians operators. Together with a
series of similar treatments of other visual processing steps covered in D. Marr’s
book (see above), it belong to the foundations of the field of computational vision.
Olshausen, B. and Field, D. (1996a). Emergence of simple–cell receptive field prop-
erties by learning a sparse code for natural images. Nature, 381:607 – 609.
This paper shows how receptive field organization in the visual cortex can be
derived from optimality criteria introduced on the statistical considerations such
as sparseness of coding, redundancy reduction etc.
Srinivasan, M. V., Laughlin, S. B., and Dubs, A. (1982) Predictive coding: a fresh
view of inhibition in the retina. Proceedings of the Royal Society London B,
216:427 – 459.
Center-surround receptive field organization is shown to be optimal in the sense
that redundancies in visual coding are removed. Diameters of center and sur-
round parts are derived from the average auto-correlation function of the visual
input.
Chapter 3
Fourier Analysis for Neuroscientists

Abstract. In this Chapter, we introduce a piece of mathematical theory that is of


importance in many different fields of theoretical neurobiology, and, indeed, for sci-
entific computing in general. It is included here not so much because it is a genuine
part of computational neuroscience, but because computational and systems neuro-
science make extensive use of it. It is closely related to systems theory as introduced
in the previous chapter but is also useful in the analysis of local field potentials,
EEGs or other brain scanning data, in the generation of psychophysical stimuli in
computational vision and of course in analyzing the auditory system. After some
instructive examples, the major results of Fourier theory will be addressed in two
steps:
• Sinusoidal inputs to linear, translation-invariant systems yield sinusoidal outputs,
differing from the input only in amplitude and phase but not in frequency or over-
all shape. Sinusoidals are therefore said to be the “eigen-functions” of linear shift
invariant systems. Responses to sinusoidal inputs or combinations thereof are
thus reduced to simple multiplications and phase shifts. This is the mathematical
reason for the prominent role of sinusoidals in scientific computing.
• The second idea of this chapter is that any continuous function (and also some
non-continuous functions) can be represented as linear combinations of sine and
cosine functions of various frequencies. Alternatively to the use of sine- and co-
sine functions, one may also use sinusoidals with a phase value for each fre-
quency, or complex exponentials from the theory of complex numbers.
Both ideas combine in the convolution theorem, stating that the convolution of two
functions can also be expressed as the simple product of the respective Fourier trans-
forms. This is also the reason why linear shift-invariant systems are often considered
“filters” removing some frequency components from a signal and passing others.

H.A. Mallot: Computational Neuroscience, SSBN 2, pp. 57–81.


DOI: 10.1007/978-3-319-00861-5_3 c Springer International Publishing Switzerland 2013
58 3 Fourier Analysis for Neuroscientists

1.0

electric field strength E(λ)


electric field strength E(x)

0
0.5

−1

0
0 500 1000 1500 2000 2500 300 400 500 600 700 800
x [nm] λ [nm]

Fig. 3.1 a. Distribution of electric field strength in two primary colors (blue and red, dashed
lines) and in the resulting mixture (magenta, solid line). b. Spectra of the pure colors red and
blue. The spectrum of the mixture includes both lines.

3.1 Examples
3.1.1 Light Spectra
The notion of a spectrum in the physics of light is closely related to the ideas of
Fourier1 transform. Light is an electromagnetic wave which can be described by the
electric and magnetic fields as a function of space and time. Consider a beam of
coherent light in direction x. At a given instant in time, the corresponding fields are
smooth functions of spatial position x. In pure, or “spectral” colors, both the electric
and the magnetic field strength oscillate according to a sinusoidal function of x, gen-
erally specified by its wavelength λ . Figure 3.1a shows the electric field distribution
of a blue (short wavelength, high frequency) and a red (long wavelength, low fre-
quency) light (dashed lines). If we superimpose both lights, we will experience the
color magenta, or purple. The electric field distribution is the sum of the according
distributions of the two basic colors red and blue (Figure 3.1a, solid line). Note that
this distribution is no longer a sinusoidal; it need not be periodic at all. Still, it can
be thought of as being composed of two sinusoidal components with different fre-
quencies. Figure 3.1b shows the spectral representation of the same lights. Here, for
each wavelength λ the amplitude of the sinusoidal with this particular wavelength
is shown. For the primaries red and blue, these are line-spectra, i.e. the spectrum is
zero except for the wavelengths λr = 650 nm and λb = 370 nm. The spectrum of the
purple light is the sum of the two line spectra.

3.1.2 Acoustics
In acoustics, time-dependent sound pressure is measured by a microphone and can
be visualized with an oscilloscope. For example, when playing the tone e’ on a treble
recorder, the oscilloscope will show a fairly clean sinusoidal wave with frequency
1 Jean Baptiste Joseph Fourier (1768 – 1830). French mathematician and physicist.
3.1 Examples 59

b’
(fifth)

g’
(major third)

g’
(minor third)

e’
(tonic)

0 10 20 30 40
a. time [milliseconds]

e’-g’-b’
(major triad)

e’-g’-b’
(minor triad)

0 10 20 30 40
b. time [milliseconds] c.

Fig. 3.2 Time- and frequency signals in music. a. Sound pressure of a pure tone at ωo =
330Hz, i.e. e’ (tonic), the minor third g’ (ω /ωo = 6/5), the major third g’ (ω /ωo = 5/4),
and the fifth b’ (ω /ωo = 3/2). The frequency ratios correspond to the just, or pure intonation.
The phases of the signals have been randomized. b. Sound pressure for the e-minor and e-
major triads e’-g’-b’ and e’-g’-b’. c. Musical notation for the same two triads. The staff and
clef define a reference frame for pitch which is akin to a logarithmic frequency scale.

330 Hz. Figure 3.2a shows the time-dependent signal for the tone e’ together with
the signals for the tones g’, g’, and b’, all of which are sinusoidal waves differ-
ing in frequency. If these tones are played together, the sound pressure signals add
up to the pattern shown in Figure 3.2b. In music, these pattern correspond to two
well-known chords, called the e-minor and e-major triads. When such chords are
played and reach the ear of a listener, the cochlea of the inner ear will decompose
the complex time signals of Figure 3.2b into the individual tones they are composed
of. Mathematically, this operation corresponds to a Fourier decomposition of the
60 3 Fourier Analysis for Neuroscientists

compound signal resulting in an acoustic spectrum with peaks at the respective fre-
quencies. The musical notation shown in Figure 3.2c represents this spectrum by
the individual note symbols.
When playing the same tone e’ on a piano, the microphone signal will again be a
periodic function with frequency 330 Hz, but the shape of the wave will be less sinu-
soidal. Representing the sound pressure function as a frequency spectrum will now
result in a peak at 330 Hz plus additional peaks at some or all integer multiples of
330 Hz. These multiples are known as harmonics. The pattern of harmonics gener-
ates the “timbre” of the sound, i.e. the differences between productions of the same
tone by various musical instruments or the voices of different singers. Complex pat-
tern of harmonics known as “formants” make the differences between vowels such
as an /a/ or an /ee/ sung with the same pitch. The pitch of the tone corresponds to its
fundamental frequency, of which the harmonics are multiples.
Complex acoustical events such as uttered speech or bird song are not easily
decomposed into tones, but may be represented as spectrograms (or sonograms).
Consider first a tune, i.e. a sequence of tones over time. In the spectrogram, the
signal will be decomposed into temporal sections (with some overlap defined by
a window function). Next, the Fourier transform within each temporal window is
computed and the result is plotted as a column of gray-values in a coordinate system
spanned by time (the center time of each window) and frequency. As a result, a line
will appear in the sonogram going up for high pitches and down as pitch is lowered.
In speech or bird song, many frequencies will be present simultaneously, resulting
in more complicated sonograms.

I6  T -

Imax

Imin
-
x

Fig. 3.3 Sinusoidal grating of the form I(x, y) = I¯+ 0.5Δ I sin(2π x/T ), where I¯ = 0.5(Imax +
Imin ). Such gratings with various period lengths T and contrasts are used for measuring of the
acuity of vision.
3.1 Examples 61

3.1.3 Vision
The representation of signals by sinusoidals and frequencies is rather intuitive in
the cases of colors and tones. In images, i.e. two-dimensional distributions of inten-
sity values, frequency representations seem to be much less intuitive at first glance.
However, properties such as resolution, acuity, or granularity of an image are closely
related to a frequency variable. Indeed, the well-known JPEG-encoding of images
rests on the discrete cosine transform (DCT), a close relative of the Fourier trans-
form. In this representation, homogeneous image regions with little intensity varia-
tion are represented by a coarse grating and thus need less storage space than image
regions with fine-grain contrasts.
Figure 3.3 shows a sinusoidal intensity variation over an image coordinate x.
Gratings are characterized by a wavelength, or alternatively by a (spatial) frequency.
Since image intensities cannot be negative, gratings will always be offset from zero
on the intensity axis. Instead of characterizing mean and amplitude, the strength of
modulation is often defined as the so-called Michelson2 contrast,
Imax − Imin
contrast := . (3.1)
Imax + Imin
In the two-dimensional case, the sinusoidals become plane waves, i.e. functions
with a sinusoidal modulation in one direction and zero variation in the orthogonal
direction (cf. Figure 2.7a). Images may then be decomposed into two-dimensional
gratings of various frequencies. Table 3.1 shows icons of two-dimensional gratings
with various combinations of spatial frequencies. Superposition of gratings amounts
to a pixelwise summation of intensities. Fourier theory posits that any image can be
generated by superimposing gratings with variable amplitude and phase (i.e., posi-
tional shift). Since the set of gratings is a two-dimensional manifold, the spectrum
becomes a two-dimensional function specifying the amplitude for each component
grating.

3.1.4 Magnetic Resonance Tomography


In nuclear magnetic resonance imaging, a slice image of a volume such as the head
of a patient is generated based on the local density of hydrogen atomic nuclei (pro-
tons). The basic underlying effect is nuclear magnetic resonance (NMR). Protons
placed in a strong static magnetic field (e.g., Ho = 1.5 Tesla) can be “activated”
by applying a second, high frequency alternating electromagnetic field. Once the
protons are activated, they “resonate”, i.e. they emit an electromagnetic wave with a
specific frequency. The frequencies for both activation and response are proportional
to the basic field strength Ho and the gyro-magnetic ratio of the atom considered.
In the imaging process, a slice is defined in the volume by adding an axial gradi-
ent field G to the basic field Ho . If z denotes the axial coordinate, the field strength
of the static field takes the form Ho + Gz. If an activating frequency is now applied,
2 Albert A. Michelson (1852 – 1931). American physicist. Nobel price in Physics 1907.
62 3 Fourier Analysis for Neuroscientists

Table 3.1 Two-dimensional gratings of the form I(x, y) = sin(2π (ωx x + ωy y)) for some in-
teger values of ωx and ωy .

ωx = −1 ωx = 0 ωx = 1 ωx = 2 ωx = 3
ωy = −1
ωy = 0
ωy = 1
ωy = 2
ωy = 3

only the protons at a particular z-location will satisfy the resonance condition. These
protons fill a plane perpendicular to the axial direction. After the activation, another
DC gradient field is applied, say along the x-coordinate. This field component is
called a read-out gradient. The resonance signals emitted by the activated protons
have frequencies proportional to the total locally “perceived” field strength. Signals
emitted from voxels with different x-coordinate will therefore have different fre-
quencies. Since all signals are overlayed in the recording, the different frequency
components will have to be isolated in order to find the contributions from voxels at
a particular x-position. This is done with the Fourier transform. The remaining prob-
lem, then, is that all voxels with a given x-coordinate, corresponding to one Fourier
component of the signal, form a line in the slice, extending along the y-coordinate.
To localize signals also in the y-direction, measurements with other read-out gradi-
ents must be performed and combined. Also in this step, the Fourier transform can
be useful.
3.2 Why Are Sinusoidals Special? 63

3.2 Why Are Sinusoidals Special?


The examples given above make it clear that sinusoidal functions are used in many
contexts, but the mathematical reason for this needs further explanation. In this sec-
tion, we will present an important relation between linear, shift-invariant systems
and sinusoidals. A system is considered a “mapping”, assigning an output function
to each input function. Extending the concept of a “functional” from Equation 2.38
we write
E = Φ (I) (3.2)
where I and E are functions such as an input image and an activity distribution on
a layer of neurons; the arguments of these functions are omitted. Mappings such
as Φ , which assign functions to functions, are also called operators. An important
characteristic of the operator Φ is its set of eigenfunctions,3 i.e. functions satisfying
the equation
Φ( f ) = λ f (3.3)
where λ is a number called the eigenvalue associated with the eigenfunction f .
Before we give a formal account of the problem, we note that the eigenfunc-
tions of convolution (i.e., linear shift-invariant) operators are likely to be periodic
functions, as is illustrated in Figure 3.4.

Fig. 3.4 Convolution of a periodic function.


The periodic function is shown in the back-
ground. At three locations, the local neigh-
borhood for the convolution is symbolized by
a circle. Clearly, the images within the circles
are identical. Therefore, the result of convo-
lution at all three positions must also be iden-
tical. It follows that the convolution of the
periodic function with an arbitrary kernel
yields again a periodic function. This re-
sults is an immediate consequence of the
translation-invariance of convolution.

3.2.1 The Eigenfunctions of Convolution: Real Notation


Let f (x) be a convolution kernel. We write the convolution operation applied to a
sine function with frequency ω as ( f ∗ sinω ) and note that this is a function (the
output function of the system) which can be evaluated at variable values x. We may
then write the convolution with the sine function sin(ω x) as

3 The term “eigenfunction” is based on the German word “eigen” which means “own”. It expresses
the fact that eigenfunctions and eigenvalues characterize the operator.
64 3 Fourier Analysis for Neuroscientists


( f ∗ sinω )(x) = f (x ) sin(ω (x − x )) dx (3.4)
−∞

where x is an auxiliary variable which cancels out as the integral is computed (cf.
Equation 2.16). Although in Equation 3.4 we use the spatial variable x, the argument
works just as well in the temporal domain if causality is taken care of by setting
f (t  ) = 0 for t < 0.
It is important for the following argument, that the difference x − x appears in
the sine-term, even though the integral remains the same if we exchange the roles of
f and the sine (commutativity of convolution).
Applying the addition theorem sin(α − β ) = sin α cos β − cos α sin β , we obtain:

( f ∗ sinω )(x) = f (x )(sin ω x cos ω x − cos ω x sin ω x ) dx . (3.5)

We may now split up the integral in two (distributive law) and move the terms not
depending on x out of the integrals, which are taken over x :

( f ∗ sinω )(x) = sin ω x f (x ) cos ω x dx − cos ω x f (x ) sin ω x dx
= f˜c sin ω x − f˜s cos ω x. (3.6)

After moving the variable x out of the integrals, the remaining integrals evaluate to
constants, for which we introduced the names

f˜c := f (x) cos ω x dx (3.7)


f˜s := f (x) sin ω x dx. (3.8)

Thus, we have shown that the convolution of a sine with frequency ω with any
mask is a weighted sum of a sine and a cosine with the same frequency ω . A linear,
translation-invariant system cannot change the frequency of a signal.
Equation 3.6 is not yet the full answer to the eigenfunction problem, since the
response to a sine is a weighted sum of a sine and a cosine. We can get one step
closer if we observe that such sums of a sine and a cosine can be written as a sine
with a phase shift. This is in fact a special case of the addition theorem stated above.
We introduce the new variables A and φ , called amplitude and phase:

f˜s
A := f˜c2 + f˜s2 , φ := tan−1 . (3.9)
f˜c

A geometrical interpretation of the quantities f˜c , f˜s , A, and φ is given in Figure 3.5.
With these variables, we obtain

( f ∗ sinω )(x) = f˜c2 + f˜s2 (cos φ sin ω x − sin φ cos ω x) (3.10)
= A sin(ω x − φ ). (3.11)
3.2 Why Are Sinusoidals Special? 65

Fig. 3.5 Geometrical interpretation of the relation of


the quantities f˜s , f˜c and A, φ A
f˜s

φI
f˜c

Equation 3.11 is as close as we can get to the sought eigenfunction equation:


when passed through a linear translation-invariant system, sinusoidals are attenu-
ated (|A| < 1) or amplified (|A| > 1, only for energy consuming systems) and shifted
in phase. Both amplitude modulation and phase shift depend on the frequency of the
sinusoidal which cannot be changed by the linear translation-invariant system. Both
effects can be formally combined in one factor if complex number theory is used.
We will explain the complex notation in the next section.

zr1 + z2
−z∗ r ib 6 rz 6 
z1 z2r 
i |z| 
BB 2 r zr 1
i z
 )
Yarg z B Y
 - B o -
−1 a 1 −1 1

−i −i
r r
−z z∗
a. b.

6 exp{iφo } 6
i sin φo sr sin φo sr


cos φo sr
Y
φo - -
cos φo φo φ

c. d.

Fig. 3.6 Complex numbers.


√ a. The complex plane is spanned by the real and imaginary axes
with units 1 and i = −1. A complex number can be expressed through its real and imaginary
parts (a,b), or by its “modulus” |z| and “argument”, or phase, arg z. b. Two complex numbers
z1 , z2 can be summed by adding their respective real and complex parts. Multiplication can be
thought of as a rotation and scaling, i.e. it adds up the arguments and multiplies the moduli. c.
The complex exponential with the purely imaginary argument iφ describes a unit circle in the
complex plane. d. The real and the imaginary parts of the complex exponential are sinusoidal
functions of φ (Euler’s formula).
66 3 Fourier Analysis for Neuroscientists

3.2.2 Complex Numbers


Complex numbers arise from solutions of algebraic equations. For example, the
equation x2 − 1 = 0, has the solutions x = ±1. In the set of real numbers, however,
the equation x2 + 1 = 0 has no solution. Formally,
√ √ one may say that the “number”
−1 solves the above equation, even though −1 is clearly not a real number. We
introduce the notation √
i = −1 (3.12)
and call i the imaginary unit. For any pair of real numbers, (a, b), we call

z := a + ib (3.13)

a complex number with real part a and imaginary part b (cf Figure 3.6a). It can
be shown that every algebraic equation (i.e. equation of the form ∑nk=0 ak xk = 0)
has exactly n solutions in the set of complex numbers, where multiple solutions are
counted accordingly. In the set of real numbers, it may have anything between 0 and
n solutions.
We consider a few basic properties of complex numbers (see Figure 3.6). For a
complex number z = a + ib, the real number
 
|z| := a2 + b2 = (a + ib)(a − ib) (3.14)

is called the absolute value or modulus of z. For a complex number z = a + ib, the
number z∗ := a − ib is called its complex conjugate. With this notation, we may
write Equation 3.14 as |z|2 = zz∗ .
The counterclockwise angle between the real axis and the “vector” z is called the
argument, of phase of z:

⎨ tan−1 b/a for a>0
argz := π + tan−1 b/a for a < 0, b ≥ 0 . (3.15)

−π + tan−1 b/a for a < 0, b < 0

For a = b = 0, the argument function is undefined. Strictly speaking, Equation 3.15


is only the so-called principle value of the arg-function, to which integer multiples
of 2π may freely be added to get the true arg function. We will not further pursue
this issue here.
The complex numbers form a plane which is spanned by the real axis with unit
1 and the complex axis with unit i. Calculations on the set of complex numbers
are defined in a straight forward way (Figure 3.6b). For example, addition of two
complex numbers is accomplished by adding the real and imaginary parts. Here,
complex numbers behave like two-dimensional vectors. They differ from 2D vectors
by their multiplication rule,
3.2 Why Are Sinusoidals Special? 67

(a1 + ib1)(a2 + ib2) := a1 a2 + i(a2 b1 ) + i(a1 b2 ) + i2 (b1 b2 ) (3.16)


= a1 a2 − b1 b2 + i a2 b1 + a1 b2 . (3.17)
     
real part imaginary part

With some computations, it can be shown that arg(z1 z2 ) = arg z1 +argz2 and |z1 z2 | =
|z1 ||z2 | (cf Figure 3.6b). This is to say, multiplication with a number z amounts to a
rotation by arg z and a scaling by |z|.
This latter result, i.e. expressing multiplication as an addition of the arg-values
and a multiplication of the moduli, is reminiscent of a property of the exponential
function, i.e. ex ey = ex+y . Indeed, a complex exponential function can be defined
based on the famous Euler4 formula:

eiϕ = cos ϕ + i sin ϕ (3.18)


e a+ib
= e cos b + ie sin b
a a
(3.19)

Two illustrations of Euler’s formula are given in Figure 3.6c, d. Figure 3.6c is called
the polar (or “phasor”) representation of a complex number, i.e. its representation
by modulus and phase:
z = |z| exp{i arg z} (3.20)
Figure 3.6d illustrates the relation of the complex exponentials to sinusoidals. As the
phasor rotates about the origin in the complex plane (i.e. as the phase φ increases),
the real and imaginary parts of the complex number describe a cosine and a sine
wave.
As a final remark on complex numbers, we note two inverted versions of Euler’s
formula that allow to go back from the complex to the real notation:
1 iϕ
cos ϕ = (e + e−iϕ ) (3.21)
2
1
sin ϕ = (eiϕ − e−iϕ ). (3.22)
2i

3.2.3 The Eigenfunctions of Convolution: Complex Notation


With the complex notation, we gain two things: first, we can calculate convolutions
with complex exponentials without applying the addition formulae for trigonomet-
ric functions, thus simplifying the calculations considerably. Second, we can com-
bine the amplification and phase shift into one complex number, thus obtaining a
true eigenfunction. For the calculation, we insert the complex exponential function
exp{iω x} into the convolution equation:

4 Leonhard Euler (1707 – 1783). Swiss mathematician.


68 3 Fourier Analysis for Neuroscientists

( f ∗ expω )(x) = f (x ) exp{iω (x − x )}dx (3.23)


= exp{iω x} f (x ) exp{−iω x }dx . (3.24)


  
f˜(ω )

The eigenvalue associated with the eigenfunction exp{iω x} is, then, f˜(ω ). The val-
ues f˜c and f˜s from Section 3.2.1 are the real and imaginary parts of f˜,

f˜ = f˜c + i f˜s = Aeiφ , (3.25)

where A = | f˜|.
The function f˜(ω ) describes for each frequency ω the action that the system
exerts on an input signal exp iω x. In the output, the frequency itself is not changed,
but the amplitude of each frequency component is multiplied by | f˜(ω )| and its phase
is shifted by arg f˜(ω ). The function f˜(ω ) is therefore called the modulation transfer
function (MTF) of the convolution system. As we shall see later, its relation to the
point-spread function f in Equation 3.24 is already the definition of the Fourier
transform.

3.2.4 Gaussian Convolution Kernels


As an example, consider the convolution with a Gaussian. In the two-dimensional
case, this example describes the projection of a slide with a de-focused slide projec-
tor. The point-spread function for this system is the image of a small light spot, i.e.
a blurring “disk”. It is mathematically described as a Gaussian where the width σ
describes the amount of blur,

x2
f (x) = exp{− }. (3.26)
2σ 2
The modulation transfer function f˜ for this system is:

x2
f˜(ω ) = exp{− } exp{−iω x} dx. (3.27)
2σ 2
Next, we collect all exponents in one exponential function and add the term
−σ 4 ω 2 + σ 4 ω 2 = 0 in the exponent. We can then move a term not depending on x
outside the integral and are left inside with a completed square expression to which
the binomial rule can be applied:

1
f˜(ω ) = exp{− (x2 − 2iσ 2 ω x − σ 4ω 2 + σ 4 ω 2 )} dx (3.28)
2σ 2

1 1
= exp{− σ 2 ω 2 } exp{− 2 (x − iσ 2 ω )2 } dx. (3.29)
2 2σ
3.2 Why Are Sinusoidals Special? 69


Next, we substitute y = x − iσ 2 ω in the integral and note that exp{−y2 }dy = π
even for complex arguments5. We thus obtain

1 y2
f˜(ω ) = exp{− σ 2 ω 2 } exp{− 2 } dy (3.30)
2 2σ
√ 1 2 2
= 2πσ exp{− σ ω }. (3.31)
2
This is a Gaussian with width 1/σ . This result is also known as “uncertainty rela-
tion”: the product of the width values of the point spread function and the MTF, σ
and 1/σ , is a constant.
We now use Euler’s formula in the form of Equation 3.21 to write the result in
real notation. Note that this is the standard way to interpret complex number results:
1 
( f ∗ cosω )(x) = ( f ∗ expω )(x) + ( f ∗ exp−ω )(x) (3.32)
2
1
= ( f˜(ω )eiω x + f˜(−ω )e−iω x ) (3.33)
2
1
= exp{− σ 2 ω 2 } cos ω x. (3.34)
2
The latter equality is due to the fact that the MTF is real and symmetric. Thus,
convolution with a Gaussian does not change the phase of the signal. This result is
true for all real, symmetric MTFs. For the sine-function, we obtain analogously
1
( f ∗ sinω )(x) = exp{− σ 2 ω 2 } sin ω x. (3.35)
2
Equations 3.31 to 3.35 show three things. First, the value of f˜(ω ) for a Gaussian
kernel is real and symmetric. Therefore, convolving with a Gaussian does not in-
volve a phase shift. In our blurred projection example, this means that the blur leads
to a symmetric smear of the point input, but not to a displacement. Second, the
amplification factor decreases as the frequency ω of the filtered pattern increases.
That is to say, the system will leave low spatial frequencies (coarse pattern) largely
unchanged, but will remove high spatial frequencies (fine detail). This behavior is
called low-pass. Of course, this behavior is very intuitive for the blurred projection
example given above. If spatial gratings of the type shown in Table 3.1 are used,
blur will leave the coarse pattern unchanged, but wash out the fine pattern. Third,
the width of the MTF, i.e. the “speed” with which f˜(ω ) decreases for higher fre-
quencies, depends on the width of the filter, σ . The wider the filter mask, the faster
does the amplification factor decrease, i.e. the stronger or more pronounced is the
low-pass property.

5 This result rests on the path independence of complex line integrals.


70 3 Fourier Analysis for Neuroscientists

3.3 Fourier Decomposition: Basic Theory


Sinewave gratings are rare patterns in the real world. The relevance of the above
theory therefore rests strongly on the fact that most functions (and virtually all natu-
rally occurring signals) can be expressed as a linear superposition of sine and cosine
waves with characteristic weight functions. We will not prove this result rigorously
here but rather give an extended motivation. For a deeper mathematical treatment,
see the texts cited at the end of the chapter.

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
0 0.5 1 1.5 2 0 0.5 1 1.5 2
a. b.
1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
0 0.5 1 1.5 2 0 0.5 1 1.5 2
c. d.

Fig. 3.7 Approximation of the “box function” (Equation 3.36 with ω = 1) by a Fourier-sine-
series (Equation 3.39). The box function is shown as a heavy gray line. In each panel, the two
thin lines show the current approximation gk (x) and the next correction term ak+2 sin(k +
2)ν x for k = 1, 3, 5, 7. For further explanations see text.

Table 3.2 Wavelength, frequency, and wave number of typical functions

sin x sin ω x sin 2π x sin 2πω x


wavelength T 2π 2π /ω 1 1/ω
frequency ω = 1/T 1/(2π ) ω /(2π ) 1 ω
wave-number k = 2π /T 1 ω 2π 2πω
3.3 Fourier Decomposition: Basic Theory 71

3.3.1 Periodic Functions


We consider first functions with the property f (x + T ) = f (x) for some T . An exam-
ple is the sine wave with T = 2π . Such functions are called periodic with wavelength
(or period) T . The inverse of the wavelength T is called the frequency, ω = 1/T .
Sometimes, we will also use the “wave number” k = 2π /T to characterize a periodic
function. The relations are summarized in Table 3.2.
As an example, consider the “box-function”,

1 if mod (x, T ) < T /2
b(x) := . (3.36)
0 else

Figure 3.7 shows this box function (heavy gray line) together with two sinusoidals.
The first one has the same frequency as the box function itself; we call this frequency
the fundamental frequency ω f = 1/T . In the sequel, we will use the “wave number”
ν = 2π /T for the sake of simplicity. The sinusoidal thus has the form

g1 (x) = 1/2 + b1 sin(ν x) (3.37)

where the amplitude b1 is chosen such that the quadratic difference between the box
function and the approximation is minimized. The factor ao = 1/2 is called
the DC,
or direct current part; it is the average of the box function, ao = 1/T oT b(x)dx. If
we consider the difference between b(x) and g1 (x), we note that it changes sign
three times as often as the fitting sine function. We can therefore arrive at a better
approximation of the box function by adding a second sine with frequency 3ν ; it
appears in the lower part of figure. (The fact that the frequency 2ν does not con-
tribute to better approximation is an accidental property of the box function.) Fig-
ure 3.7b shows the least square approximation of the box function by the function
g3 (x) = g1 (x) + b3 sin(3ν x). The approximation is better, but again deviates from
the desired shape by an oscillatory function, this time with frequency 5ν . We can
repeat this procedure of stepwise improvements and eventually obtain

gn (x) = ao + b1 sin ν x + b3 sin 3ν x + . . . + bn sin nν x (3.38)


n
= ao + ∑ bk sin kν x. (3.39)
k=1

The coefficients bk can be shown to take the values


2
for i ∈ {1, 3, 5, . . .}
bk = k π ; (3.40)
0 for i ∈ {2, 4, 6, . . .}

a procedure for determining the coefficients for arbitrary functions will be presented
below.
We call series of the type shown in Equation 3.39 (including also cosine val-
ues) “trigonometric polynomials” or Fourier series (see also Equation 3.46 below).
Figure 3.7 shows the first steps of that series, i.e. the functions g1 (x)
72 3 Fourier Analysis for Neuroscientists

(Figure 3.7a) through g7 (x) (Figure 3.7d). For each x satisfying mod (x, T ) = 0.5
and mod (x, T ) = 1, i.e. for each x where b(x) is continuous, the series converges6
towards the true functional value:

lim gn (x) = b(x). (3.41)


n→∞

Figure 3.8 shows a more general case where the sinusoidals needed to reconstruct
the signal have different phases. This can be seen by checking the value at x = 1 of
the correction term (lower sinusoid in each panel); while this is zero for all frequen-
cies in Figure 3.7, the value now changes from panel to panel. In the equation, the
phase shift is accounted for by adding a sine and a cosine term for each frequency,
each with its own coefficient ak and bk , respectively:
n n
gn (x) = ao + ∑ ak cos kν x + ∑ bk sin kν x. (3.42)
k=1 k=1

3.3.2 The Convolution Theorem; Low-Pass and High-Pass


In Section 3.2.3, we have shown that the convolution of a sine or cosine function
with some kernel amounts to multiplying the sine or cosine with an according am-
plification factor and introducing a phase shift. This result generalizes nicely to
Fourier series. Since convolution is a linear operation, we may convolve a Fourier
series with a given kernel function by convolving each sine and cosine in the series
individually with that same kernel and adding up the results afterwards. Since the
frequencies of the sines and cosines do not change, this amounts to a multiplication
of the according coefficients with some frequency-dependent factor.
As an example, we consider the convolution of the box function (Equation 3.36)
with a Gaussian (Equation 3.26). Recall that we already discussed this convolution
as modeling a de-focused slide projector taking a stripe pattern as its input and
producing a blurred image of the stripes. We write the box function as its Fourier
series

1
b(x) = + ∑ bk sin kν x (3.43)
2 k=1
Using the results from Equation 3.34 and 3.35 together with the linearity of convo-
lution, we obtain
1
(b ∗ f )(x) = (cos0 ∗ f )(x) + ∑ bk (sinkν ∗ f )(x) (3.44)
2
1 1
= + ∑ bk exp{− σ 2 (kν )2 } sin kν x. (3.45)
2 k 2
6 For the box function, convergence is pointwise, but not uniform. In the vicinity of the discontinu-
ities of the box-function, the oscillations are not damped, but move closer and closer to the edge
while the overshot remains constant. This effect is known as Gibbs phenomenon. It occurs only
for discontinuous functions, which are of little relevance for our purposes.
3.3 Fourier Decomposition: Basic Theory 73

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
0 0.5 1 1.5 2 0 0.5 1 1.5 2
a. b.

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
0 0.5 1 1.5 2 0 0.5 1 1.5 2
c. d.

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
0 0.5 1 1.5 2 0 0.5 1 1.5 2
e. f.

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2
0 0.5 1 1.5 2 0 0.5 1 1.5 2
g. h.

Fig. 3.8 Approximation of an arbitrary periodic function by a Fourier-series. In each panel,


repetitions of the function are shown as heavy gray line. The thin lines show the current
approximation gk (x) and the next correction term ak+1 cos(k + 1)ν x + bk+1 sin(k + 1)ν x for
k = 1 . . . 8. For further explanations see text.
74 3 Fourier Analysis for Neuroscientists

1.2
b(x) 0.6
ao
1 •
0.8
0.5 bk
0.4
0.6
0.3
0.4

0.2 0.2

0 0.1

-0.2 0
0 0.5 1 1.5 2 0 5 10 15 20
a. −→ x b. −→ k

1.2
— (b ∗ f )(x) 0.6
ao f˜(0)
1 •
0.8
0.5
bk f˜(kν )
0.6
0.4
— 1/2 f˜(kν )
0.3
0.4

0.2 0.2

0 0.1

-0.2 0
0 0.5 1 1.5 2 0 5 10 15 20
c. −→ x d. −→ k

1.25
— (b ∗ g)(x)
0.5

ao g̃(0)
1

bk g̃(kν )
0.4
0.75

0.5
0.3 — 1/2g̃(kν )
0.25 0.2

0
0.1
-0.25
0
0 0.5 1 1.5 2 0 5 10 15 20
e. −→ x f. −→ k

Fig. 3.9 Illustration of the convolution theorem. a. Box function (Equation 3.36 with ω = 1)
b. Coefficients of the according Fourier-series (Equation 3.39). c. Low-pass filtered version of
the box-function, using a Gaussian filter kernel(b ∗ f ). d. Fourier-coefficients for the filtered
signal. The amplification factor f˜(ω ) (Fourier transform of the Gaussian kernel) is shown
in the background (not to scale). The coefficients are obtained by multiplying the original
coefficients (Figure b.) with the Gaussian. e High-pass filtered version of the box function
obtained by convolution with g := δ − f . f Fourier coefficients of the high-pass filtered signal.
The Fourier-transform of the filter kernel (g̃ = 1 − f˜) is shown in gray (not to scale).

This is a new Fourier series with the unchanged DC component 1/2 and the new
coefficients b∗k = bk f˜(kν ). This equation also illustrates why f˜(ω ) is called a mod-
ulation transfer function: For each “modulation (i.e. sinusiodal), it specifies a fac-
tor with which this sinusoidal is transferred through the linear translation-invariant
3.3 Fourier Decomposition: Basic Theory 75

system. Figure 3.9d illustrates this result for the box function with T = 1 and a
Gaussian blur with σ = 0.2.
Figure 3.9 shows that by convolving the box function with a Gaussian, its high
frequency components are strongly attenuated while the low frequency components
are passed largely unchanged. This behavior is known as low-pass. It amounts to
a smoothing, or blur of the input function. An ideal low-pass is a step function
(Heaviside function) in the frequency domain, which however is hard to realize as a
convolution in the space- or time- domains.
If the smoothed box function is subtracted from the original one, the high spatial
frequency components that were removed in smoothing, stand out in the difference
(Figure 3.9e). Since the box function itself can be thought of as the convolution of
the dot function with the Dirac δ impulse (Equation 2.11), we can formally write
this difference as a convolution with a new kernel g defined as g := δ − f . It will pass
the high frequencies and remove the low ones, and is therefore called a high-pass. In
the frequency domain, this particular high-pass is characterized by the coefficients
g̃(kν ) = 1 − f˜(kν ) (Figure 3.9f).
This example is an instant of the convolution theorem, the central result of this
chapter. It states that the convolution of a signal with a point spread function can
be replaced in the Fourier domain by the multiplication of the Fourier transform of
the signal and the modulation transfer function, and that this MTF is the Fourier
transform of the point-spread function (see also Figure 3.11) below.

3.3.3 Finding the Coefficients


So far, we have seen that trigonometric polynomials can approximate continuous
functions. We write the general form of such polynomials as
n n
pn (x) := ∑ ak cos kν x + ∑ bk sin kν x. (3.46)
k=o k=1

Note that the DC-component has been incorporated into the first sum, keeping in
mind that cos(0ν x) = 1 for all ν and x. In Equation 3.46 ν is again the fundamental
frequency of the signal, i.e. pn repeats itself with a wave-length of T = 2π /ν . Note
that this is true even if a1 = b1 = 0, i.e. if the fundamental frequency itself is missing
in the signal. An example of such a “missing fundamental” stimulus is obtained from
the box function by subtracting its first (fundamental) Fourier component.
How can we find the coefficients ak , bk ? If we assume that every continuous peri-
odic function can in fact be written as a trigonometric polynomial,7 we can find the
coefficients by exploiting the so-called orthogonality relations of sinusoids which
hold for all k, l > 0:

7 The prove of this assumption is the mathematically hard part of Fourier theory. We do not prove
it here.
76 3 Fourier Analysis for Neuroscientists



π if k = l
sin kx sin lx dx = (3.47)
0 0 if k = l



π if k = l
cos kx cos lx dx = (3.48)
0 0 if k = l


sin kx cos lx dx = 0. (3.49)
0

Geometrical motivations for these relations can be found by plotting the involved
functions for selected values of k and l 8 . With these relations, we obtain:

2 T
ak = p(x) cos kν x dx; k ∈ {0, 1, 2, . . .} (3.50)
T 0

2 T
bk = p(x) sin kν x dx; k ∈ {1, 2, 3, . . .}. (3.51)
T 0
These latter results are proven by substituting for p(x) from Equation 3.46 and ap-
plying the distributive law until the orthogonality terms appear for each element of
the sum. All of these vanish except for the one where k = l which evaluates to π .
We call ak the Fourier-sine coefficient for the frequency kν and bk the Fourier-
cosine coefficient for the frequency kν .
In the complex notation, we proceed in the same way, using the orthogonality
relation

2π if k = l
exp{ikx} exp{−ilx}dx = . (3.52)
0 0 if k = l
The coefficients are obtained from

T
2
ck = g(x) exp{−ikν x}dx; k ∈ {. . . , −2, −1, 0, 1, 2, . . .}. (3.53)
T 0

Note that this time, we have to consider coefficients with negative index number.
The resulting series reads
n
pn (x) := ∑ ck exp{ikν x}. (3.54)
k=−n

Clearly, the coefficients of the real and complex notations are related. The equation
can be calculated from Euler’s formula. They read
⎧1
⎨ 2 (a j − ib j ) for j > 0
c j = 12 (a j + ib j ) for j < 0 . (3.55)

ao for j = 0

8 Mathematically, it can be shown that eigenfunctions of a linear operator must be orthogonal as


long as the eigenvalues differ. This consideration is consistent with the elementary result.
3.4 Fourier Decomposition: Generalizations 77

3.4 Fourier Decomposition: Generalizations


We have now achieved the basic results of Fourier theory for the special case known
as Fourier series. Fourier series apply to periodic functions or functions with finite
domain which can be made periodic by simply repeating the finite domain over
and over again. An important example is given by functions defined on the circle.
Their Fourier representation is a discrete set of coefficients associated with so-called
harmonic frequencies, i.e. frequencies that are integer multiples of the fundamental.
Spectra of such functions, as introduced in Figure 3.1 are line spectra, with each line
corresponding to a discrete frequency component.
We now discuss two extensions of the Fourier series concept.

3.4.1 Non-periodic Functions


The generalization to non-periodic functions is mathematically difficult, but intu-
itively quite easy, if we consider functions of increasing period length T . For a given
T , for example T = 2π , we have coefficients at the multiples of the fundamental fre-
quency ν = 2π /T = 1,

2kπ
ω = kν = ∈ {1, 2, 3, 4, 5, . . .}. (3.56)
T
If the period is twice as long, T = 4π , we obtain ν = 1/2 and

2kπ 1 3 5
ω = kν = ∈ { , 1, , 2, , . . .}. (3.57)
T 2 2 2
In the end, if the period is infinite (i.e. if the function is no more periodic at all), we
get a “coefficient” for every value of ω , i.e. a function of frequency. Switching back
to the complex notation, we thus obtain the Fourier transform:


g̃(ω ) := g(x) exp{−iω x}dx. (3.58)
−∞

By the same token, the trigonometric series becomes:




1
g(x) := g̃(ω ) exp{iω x}d ω . (3.59)
2π −∞

Equation 3.58 is called the Fourier forward transform and Equation 3.59 the Fourier
backward transform. Applying both in a sequence reconstructs the original function
as long as this was continuous and its square was integrable.
Note that the equations for Fourier forward and backward transform are almost
identical, up to a sign in the exponential and a normalizing factor, which can be split
symmetrically between the two transform equations. Applying the forward trans-
form twice results in a mirror image of the original function, applying it four times
reproduces the original.
78 3 Fourier Analysis for Neuroscientists

The Fourier transform of a periodic function does not produce the discrete coef-
ficients of the Fourier series, but rather a sum of δ -functions at the locations of the
multiples of the fundamental frequencies, weighted by the coefficients. In particu-
lar, the Fourier transform of a sine of frequency ωo is (δ (ω − ωo ) − δ (ω + ωo ))/2i.
Thus, Fourier series and Fourier transform are two different operations. In practi-
cal applications, where numerical versions of the Fourier transform are used, this
difference is of minor relevance.
The function g̃(ω ) in Equation 3.58 is a complex function of the real variable ω
(cf. Figure 3.10). By Euler’s formula (Equation 3.19) the complex number g̃(ωo ) =
g̃c (ωo ) + ig̃s (ωo ) for each ωo gives the amplitude and phase of the component with
spatial frequency ωo . If only the spatial frequencies present in a pattern are to be
considered, one often uses the so-called power spectrum of g, i.e. the square of
the absolute value (modulus) of ω̃ . A famous theorem in Fourier theory (Wiener9-
Khinchin10 theorem) states that the power spectrum equals the Fourier transform of
the auto-correlation function introduced in Equation 2.57. In formal notation, we
may write
Φ̃gg (ω ) = |g̃(ω )|2 = g̃g̃∗ . (3.60)

3.4.2 Fourier-Transforms in Two and More Dimensions


The Fourier transform generalizes to functions of two or more variables, such as
images or spatio-temporal intensity distributions. The sinusoidal must in this case
be replaced by a plane wave, e.g.

sin(ωx x + ωy y) = sin(
ω ·x) (3.61)

Im f˜(ω ) Im f˜(ω )

a. Re f˜(ω ) ω b. Re f˜(ω ) ω

Fig. 3.10 Complex Fourier transform of a displaced Gaussian, exp{−(x − xo )2 }. a. 3D plot


showing the complex functional values of each frequency ω as “vectors”, or pointers in the
complex plane. b. Real and imaginary parts of the same function shown separately. The
lengths of the pointers in Figure a correspond to the power of the signal (Fourier transform
of autocorrelation). The angle of the pointer in the complex plane is the Fourier phase.

9 Norbert Wiener (1894 – 1964). United States mathematician.


10 Aleksandr Y. Khinchin (1894 – 1959) Soviet mathematician
3.5 Summary: Facts on Fourier Transforms 79

as are shown in Table 3.1. The argument of the sine function is an inner product (see
below, Section 4.1.3) of a frequency vector (ωx , ωy , ...) and a vector of the original
coordinates (x, y, ...). Time and temporal frequency may be treated as just another
component of these vectors. In two dimensions, these plane waves look like corru-
gated surfaces or wash-boards whose contour lines form a set of parallel straight
lines. The orientation of these contour lines is orthogonal
 to the vector (ωx , ωy ), the
separation of wave peaks (wave length) is 2π / ωx2 + ωy2 (cf. Figure 2.7a).
The Fourier transform then becomes a complex function of two or more real
frequency variables:



g̃(ωx , ωy ) := g(x, y) exp{i(ωx x + ωy y)}dxdy. (3.62)
−∞ −∞

Each point in the frequency plane (ωx , ωy ) corresponds to one plane wave. An intu-
ition for this frequency plane may be obtained from Table 3.1: each cell in this table
represents a two-dimensional frequency vector for which the associated grating is
shown.

3.5 Summary: Facts on Fourier Transforms


1. Every (sufficiently) continuous function g can be unambiguously and reversibly
represented by its Fourier transform g̃:

forward: g̃(ω ) := g(x) exp{−iω x}dx, (3.63)


1
backward: g(x) := g̃(ω ) exp{iω x}d ω . (3.64)

g̃(ω is a complex number which can be decomposed into a sine and a cosine com-
ponent via Euler’s formula. These components are called Fourier sine and Fourier
cosine components, respectively. Intuitively, Equation 3.63 therefore means that
every continuous function can be represented as the sum of sine and cosine func-
tions.
2. In the set of functions, an infinite-dimensional coordinate system can be intro-
duced by the basis “functions” δ (x − y) for each value of y. If the function is
sampled and the list of values is treated as a vector, the canonical basis (i.e.,
basis vectors (0, ..., 0, 1, 0..., 0) ) is an approximation of this basis. The Fourier
transform can then be considered a coordinate transform with the new basis func-
tions exp{iω x}. The orthogonality constraint guarantees that this new basis is or-
thonormal. Since the
length of a vector does not depend on the coordinate system
used, the relation f 2 (x)dx = (1/2π ) f˜(ω )d ω holds (Parseval’s identity).
3. Convolution theorem: If the original function is replaced by its Fourier transform,
then convolution is replaced by multiplication:

(g ∗ h)∼(ω ) = g̃(ω ) h̃(ω ) (3.65)


80 3 Fourier Analysis for Neuroscientists

input system output


convolution with
space / time
f (x) - ∗ g(x) - h(x)
domain point image / impulse resp.

6 Fourier 6 transform 6
? ? ?
multiplication with
frequency
f˜(ω ) - × g̃(ω ) - h̃(ω )
domain modulation transfer function

Fig. 3.11 Summary of the relation of Fourier transform and linear systems theory, i.e. the
convolution theorem.

(see Figure 3.11). The commutativity and associativity of convolution, follow


directly from this theorem.
4. Correlation or Wiener-Khinchin theorem: The Fourier transform of the cross-
correlation function of two functions f , g,

Φ f g (y) := f (x)g(x + y)dx (3.66)

is given by the equation


Φ̃ f g (ω ) = f˜(ω )g̃∗ (ω ) (3.67)
z∗
where := Re z − i Im z is the complex conjugate of z.
5. The modulus (complex absolute value) of the Fourier transform of a function
f is known as the power spectrum of f . As a special case of the correlation
theorem, we note that the power spectrum equals the Fourier transform of the
auto-correlation function of f (Equation 3.60).
6. Shift theorem: Let g(x) be a function with Fourier transform g̃(ω ) and s ∈ IR a
number specifying a shift of g. The shifted version of g, gs (x) := g(x + s) has the
Fourier transform
g̃s (ω ) = exp{−iω s}g̃(ω ) (3.68)
Due to the symmetry of the Fourier transform, this implies that the Fourier trans-
form of a Gabor function is a displaced (shifted) Gaussian.
7. Scale theorem: Let g(x) be a function with Fourier transform g̃(ω ) and a ∈ IR a
scaling factor. The scaled version of the function, ga (x) := g(ax) has the Fourier
transform
1 ω
g̃a (ω ) = g̃( ). (3.69)
a a
The uncertainty relation studied in Equation 3.31 is a special case of the scale
theorem.
3.6 Suggested Reading 81

3.6 Suggested Reading


Books
Bracewell, R. N. (2003). Fourier Analysis and Imaging. Kluwer/Plenum, New York.
Butz, T. (2005). Fourier transformation for pedestrians (Fourier series). Berlin:
Springer Verlag, Berlin.
James, J. F. (2011). A student’s guide to Fourier transforms. With applications in
physics and engineering. 3. edition, Cambridge University Press
Tolstov, G. P. (1962). Fourier series. Prentice-Hall, Inc., Englewood Cliffs, NJ.

Original Papers
Daugman, J. (1980) Two-dimensional spectral analysis of cortical receptive field
profiles. Vision Research 20:847 – 856
Extends spatial frequency models of the receptive field to the two-dimensional
case and gives clean definitions of orientation and frequency specificity.
Hendrikson, L., Nurminen, L., Hyvärinen, L., and Vanni, S. (2008) Spatial fre-
quency tuning in human retinotopic visual areas. Journal of Vision 8(10):5,1-13
Spatial frequency tuning in human visual cortex is studied based on fMRI
data. The paper shows differences between different areas and gives a useful
overview of earlier findings on spatial frequency tuning.
Jones, J. P. and Palmer, L. A. (1987) An evaluation of the two-dimensional Gabor
filter model of simple receptive fields in cat striate cortex. Journal of Neuro-
physiology 58:1233 – 1258
Describes detailed fits of cortical receptive field profiles by Gabor function with
suitable parameters. Additionally, spatial frequency specificities are measured
with grating stimuli (see next chapter) and the results are compared to predic-
tions derived from the Gabor fits.
Chapter 4
Artificial Neural Networks

Abstract. In models of large networks of neurons, the behavior of individual neu-


rons is treated much simpler than in the Hodgkin-Huxley theory presented in Chap-
ter 1: activity is usually represented by a binary variable (1 = firing; 0 = silent)
and time is modeled by a discrete sequence of time steps running in synchrony for
all neurons in the net. Besides activity, the most interesting state variable of such
networks is synaptic strength, or weight, which determines the influence of one
neuron on its neighbors in the network. Synaptic weights may change according
to so-called “learning rules”, that allow to find network connectivities optimized
for the performance of various tasks. The networks are thus characterized by two
state variables, a vector of neuron activities per time step and a matrix of neuron-
to-neuron transmission weights describing the connectivity, which also depends on
time. In this chapter, we will discuss the basic approach and a number of important
network architectures for tasks such as pattern recognition, learning of input-output
associations, or the self-organization of representations that are optimal in a certain,
well-defined sense. The mathematical treatment is largely based on linear algebra
(vectors and matrices) and, as in the other chapters, will be explained “on the fly”.

4.1 Elements of Neural Networks


Artificial neural networks are built of three major building blocks which will be
discussed in detail in the sequel. These building blocks are:
1. The activity (excitation) of a neuron together with its activation dynamics. As
already discussed in the theory of receptive fields, activity is generally modeled
not as a membrane potential, but as a number reflecting instantaneous spike rate.
In neural network models, it is normally discretized into separate time steps.
2. The synaptic weights and learning rules governing the flow of activation in the
network. This flow is also called the activation dynamics. Change of synaptic
weights (“weight dynamics”) as a result of previous activation pattern is studied
as a model of learning.

H.A. Mallot: Computational Neuroscience, SSBN 2, pp. 83–112.


DOI: 10.1007/978-3-319-00861-5_4 c Springer International Publishing Switzerland 2013
84 4 Artificial Neural Networks

3. The topology of the network is the pattern of connectivity between the neurons
involved. It is generally described by a weight matrix, but higher level descrip-
tions such as feed-forward vs. feed-back or layered vs. completely connected
are also used. We will assume the overall topology fixed, but dynamic changes
such as the creation of new neurons are also studied in the literature (“growing
networks”).

4.1.1 Activity and the States of a Neural Network


We start by numbering the neurons in a net with positive integers i ∈ IN. For
each neuron i, the number ei is the current state of excitation or activity. In “log-
ical” neurons, e can take only the values 0 or 1. In other models, the activity
may be considered a continuous variable, either in the interval [0, 1] or in the real
numbers IR.
If we consider a network with I neurons, we can write the activity states of all
neurons as an ordered list, or vector
⎛ ⎞
e1
⎜ e2 ⎟
⎜ ⎟
e := ⎜ . ⎟ = (e1 , e2 , ..., eI ) . (4.1)
⎝ .. ⎠
eI

If ei ∈ IR, the set of all possible activity vectors forms a vector space with dimension
I, IRI . Note that I may be much larger than three, rendering geometric interpretations
of this vector space difficult. In the sequel, we will occasionally give geometric in-
terpretations for I = 2 or I = 3. For the general case, e should be thought of as an
ordered list without trying to imagine I-dimensional spaces. Note also that, follow-
ing standard conventions, vectors are always considered to be columns of numbers.
If we need to consider rows, we use the “transposition” symbol (e or e ) which
works in both directions, i.e., (e ) =e.
The vector e is also called a state-vector of the neural network since it contains
the information about the current activity state. In time-discrete models, an upper
index t is attached to the state vector which is then written as e t .
If the neurons have spatial coordinates, as in the cases studied in previous chap-
ters,e becomes a function e of space and time and each “neuron” i is identified with
a point (xi , yi ), leading to the interpretation

e t = (e(x1 , y1 ,t), e(x2 , y2 ,t), ..., e(xI , yI ,t)) . (4.2)

Equation 4.2 relates discrete neural network theory to layers of continuous activity
functions as were studied in convolution systems, Equation 2.16.
4.1 Elements of Neural Networks 85

4.1.2 Activation Function and Synaptic Weights


To model the dynamic development of neural activity in the network, each neuron
is given an activation function (also called transfer function)

αi :e t → eit+1 , (4.3)

which takes the complete activity vector of the network at time t as its input and
the activity of neuron i at time t + 1 as its output. The activation function is usually
described by synaptic weights wi j and a static non-linearity f :
J
αi (e) = f ( ∑ wi j e j ). (4.4)
j=1

Examples of static non-linearities which are also used in neural networks have been
given in Section 2.3.2. The weighted sum forming the argument of the static non-
linearity, ∑ wi j e j , is called the “potential” of cell i and denoted by ui .
Here, we use the convention that the weights wi j are indexed with the number
of the post-synaptic cell (i) and the pre-synaptic cell ( j). In more detail, one might
read “weight of input received by i from j.” If a unit k does not receive input from
unit l, the weight wkl is set to zero. Thus, the set of all weights also determines the
connectivity pattern or topology of the network.
Note that except for the non-linearity f , Equation 4.4 is analogous to the correla-
tion Equation 2.9 if we discretize the input plane (x j , y j ) and consider the image in-
tensities I(x j , y j ) as the presynaptic inputs e j (e.g., activities of receptor cells). The
corresponding values of the receptive field function φ (x j , y j ) become the weights
wo j of the sole output unit eo . The differences between the two equations are (i)
that in Equation 2.9 we have identified neurons with their spatial coordinates (x , y )
which implies a dense, continuous layout and hence an integration operation over
space (cf. Equation 4.2), and (ii) that the receptive field function φ in Equation 2.9

e1 Q
e2~
|
H Q s w
activation function, α
w21  HH Q i1
* B Y H HH
w32
H
j e2 P Pq Q
~ 
| B 23 He3~ |
 B M
B PwP i2 Q f
e1 w
P Q 6
k
Q B w - i3 P Σ
w ui- - - ei
Q w42 B B
24 e3
Q B  

w14 B
Q NB B  w34
.. 
Q 
QeB4~
| . w
3
a. b. eJ 
 iJ

Fig. 4.1 a. A simple neural network consisting of four units (“neurons”) with activities
e1 , . . . , e4 and links with transmission weights wi j . For explanations see text. b. Simple model
neuron for artificial neural networks. e j : Input activities, wi j synaptic weights, Σ summation,
ui potential (ui = Σi=1 J w s ), f static non-linearity, e Output activity (e = f (u )).
ij j i i
86 4 Artificial Neural Networks

is not modeling just one synaptic layer, but describes the entire preprocessing from
the retina to the neuron in question.

4.1.3 The Dot Product


For a given postsynaptic neuron, the index i in Equation 4.4 is fixed and the set
of weights can be considered a vector of the same dimension as the input vector.
We denote the vector of input weights of a given neuron i as wi where i is the first
(“postsynaptic”) index and note that wi := (wi1 , wi2 , . . . , wiJ ) . The potential ui is
then obtained as the “dot product” of the vectors wi and e,
J
ui = (wi ·e) := ∑ wi j e j . (4.5)
j=i

The dot product is also known as the inner product, in which case it is often written
as w
i e, or as scalar product to indicate that the result is a scalar (i.e. a number as
opposed to a vector). It has various geometric interpretations, which are helpful to
understand neural network theory.

Norm of a Vector
First, the length, or “norm” of a vector can be defined as
 
x := ∑ x2i = (x ·x), (4.6)
i

which in two dimensions is simply Pythagoras’ theorem. Clearly, the length of a


vector is well defined in any number of dimensions. A vector with norm 1 is called
a unit vector. For any vector (except (0, 0, ...)), a unit vector with the same direc-
tion can be generated by dividing the vector by its norm. This process is called a
normalization.

Orthogonality and Angles


Second, two vectors with vanishing dot product, (x·y) = 0, are said to be orthogonal,
or uncorrelated. Orthogonality is a geometrical concept which, like length, easily
generalizes to multiple dimensions. Consider two non-collinear vectors in three-
dimensional space; they span a plane, i.e. a two-dimensional subspace, in which the
intuitive notion of an angle applies. The same is true also for higher dimensional
spaces. The dot product can therefore be used to define angles; in this sense, the dot
product of two unit vectors equals the cosine of the angle included by these vectors:

(x ·y) = ||x|| ||y|| cos x,y. (4.7)


4.1 Elements of Neural Networks 87
p
y  p
Fig. 4.2 Dot product as projection. For two vectors x,  p
p
y, the projection of y to x is the vector p. Its length is  p
p
given via the dot product as ||p|| = (x ·y)/||x|| = (xo ·y)  p
p
where xo = x/||x|| is the unit vector in direction of x.  p
p
For the vector p itself, we thus obtain p = (xo ·y) xo . To  p
p
 p 
:
p  x
prove this relations, verify that the projection line y − p
 
:
is orthogonal to x.  
  p

Thus, if the dot product of two vectors is zero, the angle included by the vectors
is ±90◦. The relation of the dot product to correlation will be explained later (see
Equation 4.16).

Projection
Finally, the dot product is related to the notion of a projection. If an arbitrary vector
(x1 , x2 , ...xn ) is multiplied with a coordinate vector (0, . . . , 0, 1, 0, . . . 0), where the
only 1 appears at position j, the result is x j , i.e., the component of the vector in
direction of the j-th coordinate vector. In general, the dot product of two vectors
can be considered as the length of the component of one vector in the direction of
the other (see Fig. 4.2). Clearly, for orthogonal vectors, this length is 0.

4.1.4 Matrix Operations


If we neglect the static non-linearity in Equation 4.4 (i.e. if we assume f (u) = u for
all u), we obtain for the example given in Figure 4.1a the activation update rule:

e1t+1 = 0e1t + 0e2t + 0e3t + w14 e4t


e2t+1 = w21 e1t + 0e2t + w23 e3t + w24 e4t
. (4.8)
e3t+1 = 0e1t + w32 e2t + 0e3t + w34 e4t
e4t+1 = 0e1t + w42 e2t + 0e3t + 0e4t

In matrix notation, the same equation reads:


⎛ t+1 ⎞ ⎛ ⎞ ⎛ ⎞
e1 w11 w12 . . . w1I e1t
⎜ e t+1 ⎟ ⎜ w21 w22 . . . w2I ⎟ ⎜ et ⎟
⎜ 2 ⎟ ⎜ ⎟ ⎜ 2⎟
⎜ . ⎟ = ⎜ .. .. .. ⎟ ⎜ .. ⎟ . (4.9)
⎝ .. ⎠ ⎝ . . . ⎠ ⎝ . ⎠
eIt+1 wI1 wI2 . . . wII e1t

Note that each component of the output vector e t+1 j is a dot product of the j-th row
of the matrix and the input vector e t . Weights such as w12 which are missing in the
figure are simply set to zero. The matrix of all weights is a square matrix and will
be denoted by W . We may then write Equation 4.9 shorter as:
88 4 Artificial Neural Networks

e t+1 = W e t . (4.10)

4.1.5 Weight Dynamics (“Learning Rules”)


At the core of neural network theory is the introduction of synaptic learning rules,
i.e. rules for the change of synaptic weights as a function of the network’s state. We
briefly summarize here the most important learning rules. Further explanations will
be given in the course of the presentation. The detailed relation between the physio-
logical mechanisms of synaptic plasticity and the resulting learning mechanisms is
a topic of active research.
1. Unsupervised learning is usually modeled by variants of the Hebb1 rule. This rule
states that the synaptic weight is increased if the pre- and post-synaptic cells are
active in subsequent time steps, i.e. if the pre-synaptic cell fires and the synapse
is “successful” in the sense that the post-synaptic cell fires as well. The Hebb-rule
models classical conditioning (see Fig. 4.3). A possible physiological mechanism
of Hebbian learning is long term potentiation.
2. In competitive learning, the weight change is determined not only by the ac-
tivities of the pre- and post-synaptic neuron, but also by the activities of other,
“competing” neurons. Usually, only the most active neuron in a group (the “win-
ner”) will be allowed to learn. Competitive learning is used in models of self-
organization such as Kohonen’s self-organizing feature map. Physiologically,
competition may be about resources needed for synaptic growth.
3. In reinforcement learning, the weight change is determined by the activities of
the pre- and post-synaptic neuron and by a “pay-off” signal carrying information
about the overall performance of the network. The pay-off signal is one global
variable transmitted to all units in the network, telling them whether the last
action was good or bad. It does not say what should have been done instead. As
possible physiological mechanisms, neuro-modulation has been discussed.
4. In supervised learning, the weight change is determined by a specific teacher
signal telling each neuron how it should have reacted to its last input. The most
popular scheme for supervised learning is back-propagation. The physiological
basis of a teacher signal is hard to imagine.


Fig. 4.3 Classical conditioning and the Hebb synapse. US
H

CS: conditioned stimulus, US: unconditioned stimulus, HjH w0
H
R: Response, w0 , w1 : synaptic weights. Before learning, HH
a response R is elicited only by the unconditioned stim- R -

ulus US. If CS and US occur together, w1 will be rein-  w1
forced since CS and R are both active. Eventually, CS *


can drive unit R by its own. CS

1 Donald O. Hebb (1904 – 1985). Canadian psychologist.
4.2 Classification 89

s2 6
@@@@@@@@@@@@@@@@@@@ s1 s2 s1 s2
@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@−1 R +1 1 
R 1
@@@@@@@@@@@@@@@@@@@ ∑ 2

2

@@@@@@@@@@@@@@@@@@ @ 
@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@ ?@@ ?
@@@@@@@@@@@@@@@@@@@ average
@@@@@@@@@@@@brighter
@@@@@@@ brightness
@@@@@@@@@@@on@ @@@@@@@
right
> 0.5
-
s1
a. b. c.

Fig. 4.4 Simple perceptrons with only two input lines. a. Feature space. b. Perceptron de-
tecting patterns with the characteristic “brightness increasing to the right”. c. Perceptron de-
tecting patterns whose average intensity is above 0.5.

4.2 Classification
4.2.1 The Perceptron
As already pointed out in Section 2.1.1, the function of neurons is often seen in
recognizing or detecting things, or “pattern”. Since all inputs to a neuron arrive
through its synapses, such pattern are eventually vectors of input activities. The idea
is that neurons implement logical predicates (i.e. statements that are either true or
false) about this stimulus vector. The value “TRUE” is signaled by the neuron taking
the activity 1, the value “FALSE” by the activity 0. A simple example is shown in
Fig. 4.4. Consider a neuron with just two inputs, let’s say from light sensitive recep-
tors. How can we build a neuron that responds if the right receptor receives more
light than the left receptor? Simply enough, we can use an excitatory connection
from the right receptor and an inhibitory connection from the left (Fig. 4.4b). If we
apply a step non-linearity to the weighted input sum, the cell will be active if and
only if the right sensor receives more light.
For the understanding of the perceptron, and indeed for pattern recognition in
general, the notion of a feature space is crucial. In our example, the feature space
consists of all possible pairs of light intensities (s1 , s2 ) that might be delivered to the
two sensors. This is equivalent to the positive quadrant of the real two-dimensional
plane (Fig. 4.4a). Each possible pattern corresponds to a point in feature space and
the perceptron will react with activity 1 to patterns taken from a certain subset of the
feature space; other patterns will elicit no response. In the example of Figure 4.4b,
all light intensity pairs where the right pixel is brighter fall in a triangular area
above and left of the diagonal s1 = s2 . Figure 4.4c shows a second example of
a perceptron with two inputs calculating the predicate “average brightness above
90 4 Artificial Neural Networks

0.5.” All pattern satisfying this condition are right and above the line s1 + s2 > 1
also shown in Figure 4.4a.
In general, feature spaces have a separate dimension for each input line of the
perceptron. If one considers the retina as the input layer, the number of dimensions
becomes the number of ganglion cells which in humans is in the order of 106 .
In terms of our neural network models, we can model a perceptron as a single
unit with inputs s1 , ..., sN , forming an input vector s, and with weights w1 , ..., wN ,
forming a weight vector w. The weighted sum of the inputs is passed through a step
non-linearity with threshold θ
J
u= ∑ w js j = (w ·s) (4.11)
j=1

0 if u ≤ θ
e = f (u) := . (4.12)
1 if u > θ

The perceptrons in Fig. 4.4b, c have w = (−1, 1) with θ = 0, and w = (0.5, 0.5)
with θ = 0.5, respectively. Note that the weight vector w has the same dimension as
the stimulus vector and can therefore be interpreted as a vector in feature space.

4.2.2 Linear Classification


Decision Boundary
We now ask the question of which subsets of a feature space can be recognized by
a perceptron. This question is usually answered in terms of a “decision boundary”,
i.e. a line in feature space separating stimuli that the perceptron does or does not
respond to. Crossing this boundary will mean that the potential u changes sign; the
boundary is therefore determined by setting u = 0 in Equation 4.12. In the resulting
equation (w ·s) = θ , the weight vector w is a constant and the set of all points s
satisfying the equation is a straight line orthogonal to w, passing the origin with
distance θ /w (Fig. 4.5). The decision boundary of a perceptron with two inputs
is therefore a straight line in feature space and cannot be curved. The perceptron is
thus an example of a “linear classifier” where the term linear refers to the nature of
the decision boundary, as well as to the summation part of the activation function
(Equation 4.11).
Another way to think of linear classification relates to the projection property of
the dot product in Equation 4.12 (cf. Fig. 4.2), which implies that any input vector s
will be projected orthogonally onto the line defined by the weight vector. All points
on lines orthogonal to the weight vector will therefore end up at the same location
and will be put in the same category in the classification step.
In perceptrons with three input lines (J = 3), the feature space will have three di-
mensions. The weight vector still defines a straight line in that space. At a distance
θ /w along that line, the potential will take the value zero. The decision bound-
ary then becomes a plane orthogonal to the weight vector and passing through the
4.2 Classification 91

s2 6
p pp
pp
Fig. 4.5 The perceptron classifies feature
u<0
pp

0
vectors to one side of the linear hyper-

u=
plane (dotted line) (w ·s) = θ in feature p
p pp s •

space. The distance at which this hyper- p 
plane passes the origin is θ /w. An input pp
vector s is projected orthogonally onto the p pp  u>1

pp
p -
weight vector direction, yielding the poten-  
H 
H p p p  s1
tial u.
θ
H  p 
pp H
w
 p H H

j
H
p p u w

u = 0-point on the weight vector line. Mathematically, this plane is still defined by
the equation (w ·s) = θ which is also known as the normal form of a plane equation.
Indeed, this logic generalizes to an arbitrary number of dimensions. The decision
boundary in a J-dimensional feature space will always be a linear subspace with
dimension J − 1, cutting the feature space into two parts. Such subspaces are called
“hyperplanes”.

Optimal Stimulus
On inspection of Fig. 4.4 b, it can be seen that the perceptron detecting a brightness
increase to the right has a weight vector w = (−1, 1) which, when interpreted
as an input image, also corresponds to a rightwards increase in brightness. This
observation reflects a general principle, i.e. that perceptrons are “matched filters”
detecting patterns similar to their weight vector. In the same logic, we discussed
center-surround receptive fields with an inhibitory center and excitatory surround as
“bug detector” in Section 2.1.1 above.
Mathematically, we can ask which pattern will lead to the strongest potential in a
perceptron. Since the calculation of the potential is a linear operation and therefore
is doubled by doubling input intensity, this question is only meaningful if we nor-
malize the input pattern, i.e. consider only stimuli s satisfying s = 1. In this case,
we call
s ∗ := argmax(w ·s) (4.13)
s

the “optimal stimulus” of the perceptron. From the interpretation of the dot product
as a projection, it is clear that the maximum is reached if s and w point in the same
direction,
w
s ∗ = . (4.14)
w
Formally, this result is proven in the so-called Cauchy-Schwarz-inequality, (a ·b) ≤
ab. The situation is completely analogous to the discussion of optimal stimuli
and matched filters in Section 2.1.1. In fact, if we consider continuous layers of
neurons (Equation 4.2), the resulting feature space will have an infinite number
92 4 Artificial Neural Networks

of dimensions, one for every point of the continuous plane. In this situation, the
dot-product is no longer a discrete sum, but an integral, and in fact the correlation
integral defined in Equation 2.9.
The idea of optimal stimuli can also be related to the statistical notion of covari-
ance. Formally, we can calculate the covariance of s and w by considering the value
pairs (s1 , w1 ), (s2 , w2 ), . . . , (sJ , w j ). If we denote the means as
J J
1 1
s̄ :=
J ∑ sj and w̄ :=
J ∑ w j, (4.15)
j=1 j=1

we obtain
J
1 1
cov(s,w) =
J ∑ (s j − s̄)(w j − w̄) = J (s · w) − s̄w̄. (4.16)
j=1

That is to say, the dot product equals the covariance up to a constant scaling factor
1/J and an additive constant which vanishes if the means are zero. Like the dot
product, covariance is maximal if s and w are aligned.

4.2.3 Limitations
If the perceptron is a good model of neural computation, it should be able to do
more interesting things than the examples given in Fig. 4.4. This is indeed the case
if the dimension of feature space is large. However, in the development of neural
network theory, two limitations of the perceptron have received much attention, the
so-called “XOR-problem” and the locality of the perceptron.
Suppose we want to construct a two-pixel perceptron that responds with e = 1
if one and only one of its inputs is active. In logic, the related predicate is called
“exclusive or” or XOR. If we plot this in two-dimensional feature space, the four
stimulus vectors,s = (0, 0), (0, 1), (1, 0), and (1, 1), form a square. Clearly, there can
be no line (hyperplane) such that the points (0, 1) and (1, 0) fall on one side of the
plane and (0, 0) and (1, 1) fall on the other side. If classification boundaries are linear
hyperplanes, this problem can only be solved by using a cascade of perceptrons
(multi-layer perceptron, MLP), where the outputs of three initial perceptrons form
the input of a fourth, higher level perceptron (Fig. 4.6). The output units of the initial
perceptrons are called “hidden” because their activities do not explicitly show up in
the classification procedure, i.e. are neither input nor output. Here, we denote the
activity of the hidden units by the letter h. The response of the resulting three-layer
perceptron, using the weights shown in Figure 4.6, is illustrated by the table in
Fig. 4.6b.
If we allow for multiple layers, perceptrons can thus do more than just linear
classification. By means of the hidden units, the classification problem is embedded
in a higher dimensional space. Figure 4.6c shows the position of the resulting h
vectors for the XOR-problem as corners of a unit cube in three-dimensional space.
In this embedding, it is possible to separate the “true” and “false”-cases by a linear
decision boundary, i.e. the plane −h1 + 2h12 − h2 = 0 also shown in the figure. The
4.2 Classification 93

a.
 (1, 1, 1)
+1 h1  s
  @−1    @
 
s 
* R
@
(1, 0, 1)s
 @
s1 H @
j
H
H+1 @  @ @ @
HH +2
h12 - @ u - @ q w @(0,
@ s 1, 1)

+1
 @ J ]  q 

s 
* 

s2 H @ Jqq h12
H+1
j
H
HH
  q J
h2 −1 6
  
  @
@
b.  @
In Hidden Out @
s1 s2 h1 h2 h12 e
@ @
h1 @
@

I
@ 
@@ 
0 0 0 0 0 0
 
  : h
1 0 1 0 1 1 @s  2
0 1 0 1 1 1 c. (0, 0, 0)
1 1 1 1 1 0

Fig. 4.6 Three-layer perceptron implementing an exclusive or. a. Network topology with
two input lines and three hidden units. The thresholds are set to θ = 0. b. Truth table. c. 3D-
feature space for the final unit, taking the hidden units as input lines. Possible input vectors
are marked by black dots on the cube. In the 3D feature space, linear separation of “true” and
w = (−1, −1, 2) , which
“false”-cases with a plane is possible. w denotes the weight vector, 
is normal to the decision plane. The intersection between the decision plane and the unit cube
in h-space is indicated by a bold line.

“true” cases (0, 1, 0) and (1, 0, 0) fall above this plane and yield the output 1; the
“false” cases (0, 0, 0) and (1, 1, 1) lay in the decision plane and yield the output 0.
When projected back into the s1 , s2 -plane, the decision boundary is no longer linear
(i.e., a line), but a polygon with the desired separation properties.
A second limitation becomes apparent if a generalization of the XOR-problem
is considered which is known as the parity problem: is it possible to build a per-
ceptron that responds if and only if the number of active inputs is odd? For just two
inputs, this is the XOR-problem. Again, such perceptrons are possible if three layers
are used. One can show, however, that each such perceptron will need at least one
hidden unit looking at all inputs simultaneously. Similarly, if the “connectedness”
of complex geometrical line patterns such as intertwined spirals is considered, one
hidden unit is required receiving input from the entire input layer. This is in conflict
with the idea of breaking down vision into local bits and pieces, each processed by
a unit with a small receptive field. The famous book by Minsky & Papert (1988)
discusses this problem in detail. It has led many researchers to believe that neural
networks cannot explain perception. More recently, however, it has been recognized
that the parity and connectedness problems, interesting as they are from a compu-
tational point of view, may not be the central problems in visual perception and
visually guided behavior. In fact, judging the connectedness of intertwined spirals,
say, is not an easy task for human observers either.
94 4 Artificial Neural Networks

4.2.4 Supervised Learning and Error Minimization


Until now, we have assumed that the weights are given, or set be the modeler,
which of course raises the question how this can be achieved. Mathematically, there
exists an elegant solution which, for multi-layer perceptrons, is known as back-
propagation, an example of supervised learning. It can be used to check whether
a MLP for a given problem exists, but is hard to interpret as a physiological
mechanism.
Consider a perceptron whose task is to classify a pattern presented on its input
layer. For example, the perceptron should respond with activity 1 if the pattern is a
capital letter “A”, and it should stay silent in all other cases. The desired performance
of the perceptron may then be described by a function T mapping the input set
(feature space) into the set of activity values of the perceptron. For example, T (s) =
1 if s is the image of the capital letter “A”; and T (s) = 0 if it is not. T (s) is called
the teacher signal. How can we determine the optimal weights in this case?

Error Function for Two-Layer Perceptrons


The original learning rule for perceptrons was derived from a simple heuristics. As-
sume a unit’s output is higher than the teacher signal. In this case, the input weights
which have been carrying an activation should be reduced. Likewise, if the output
is below the teacher signal, the weights of active input lines have to be increased.
Weights of non-active input lines are left unchanged in both cases. Formally, this
rule can be written as
Δ w j = λ (T − e)s j , (4.17)
which evaluates to a positive change if the output was too low and a negative change
if the output was too high. λ ∈ IR+ is a “learning rate”; it can be used to enforce
convergence of the learning process by gradually decreasing it to zero.
This simple idea assumes that it is known which weights are “responsible” for a
given output. However, this is often not the case in more complex network topolo-
gies. We therefore consider a more systematic approach. To this end, we define the

 output
e1 e2 e3 e4 ei = f (ui )
   
6 6 6 6
u1 u2 u3 u4 ui = ∑ j wi j s j
6
a 6 6 !6
Fig. 4.7 Set of four two-layer per- Q
AQ A Q
aa Q  A!! 
ceptrons e1 , . . . , e4 with common Q a a Q
! ! 
weights
A Q A
! a 
Q A  wi j
input vector s1 , . . . s3 . The topology  !Q
A!
  aa
A QA
is a complete feed-forward connec- 6  6  6
tivity between the input and out- s1 s2 s3 input s j
put layer, just as in the hetero- 
associator shown in Fig. 4.11.
4.2 Classification 95

performance error of the network by comparing actual and desired outputs. This is
formulated in terms of an error function depending on the network weights. Learn-
ing then becomes a minimization problem, i.e. it amounts to finding a set of weights
such that the error is minimized.
We start by rewriting the activity of a simple perceptron as in Equation 4.12:
J
e(s) = f ( ∑ s j w j ) = f ((s · w)) (4.18)
j=1

where f is a suitable non-linearity including a possible threshold. In this section,


we will assume that f is differentiable. Instead of the step non-linearity, we might
therefore use a sigmoidal function.
Since we know what the answer of the perceptron to a stimulus s should be—
it is defined by the teacher signal T (s)—we can now define a performance of the
perceptron via the squared error E :
   2
J
E(w,s) := [e(s) − T (s)] = f
2
∑ sjwj − T (s) . (4.19)
j=1

Clearly, this is the error for an individual stimulus presentation. Eventually, we


would want to calculate the average error over all stimuli and minimize this by
choosing the best possible weight vector. The average error for a given training set
of stimuli is given as the average of E(w,s t ) over all stimuli s t in the training set.
We can therefore drop the arguments and obtain an error function depending solely
on w. The optimal weight set is the one minimizing the error function E(w).

Gradient Descent
Minimization of functions of several variables is a general and important topic of
scientific computing. If the error function (also known as cost function or objective
function) is differentiable, the following logic is generally used:
The error function represents a “surface” over the space spanned by its variables,
in our case w1 , ..., wJ (see Fig. 4.8a). Note that this space can be identified with the
feature space. Every error value is an “elevation” in that landscape and the optimum
(minimum) corresponds to the deepest “depression” or trough in that landscape.
For finding this minimum we need to consider the partial derivatives of E with
respect to each coordinate w j . Fig. 4.8a shows a point (a, b) in the (w1 , w2 )-plane
and its surface elevation E(w1 , w2 ). We now consider a vertical section through the
landscape passing through (a, b) and running in parallel to the coordinate axis w1 .
Within this section, the error function reduces to E(w1 , b), i.e. a function of just one
variable w1 since b is constant. The derivative of this function at w1 = a is called the
partial derivative of E at (a, b) in direction w1 :

∂E E(a + h, b) − E(a, b)
Ew1 (a, b) = (a, b) := lim . (4.20)
∂ w1 h→0 h
96 4 Artificial Neural Networks

6
E(w1 , w2 ) w2 6

cv
PP p - s
P p ]Jv
J c

pp PP w1
E


Ew2 ppp 
cv
pp
pp 
pp p p p - c
v

p pp a w1 
* 3
r v
c w
grad
k
Q p p pp p p v1
E

p p p p p p p p p p pQ c 
w2

b w1

-

w1
a. w2

b.

Fig. 4.8 Learning as error minimization for a two-dimensional example. a. The error function
is shown as a surface over the weight space. At a point (a, b) in weight space, the partial
derivatives are defined via tangents to the surface in the coordinate directions. The gradient is
the direction of steeped ascent. b. The error function is shown as a contour plot. The optimum
corresponds to the deepest trough in the error-surface. Numerical optimization searches this
trough by starting at a random location w1 , and calculating the local gradient, i.e. the steepest
upward direction. It then moves the current weight estimate by a certain step in the negative
gradient direction, i.e. downhill, and iterates. If a minimum is reached, the gradient evaluates
to zero and the procedure stops. Note that this approach can be caught by “local minima”,
since it cannot take into account distant troughs.

Figure 4.8a shows for the surface point (a, b, E(a, b)) the tangent line in the sectional
plane together with a unit vector in the w1 direction. The difference at the tip of this
vector is the partial derivative. The same logic can be applied to the w2 direction,

∂E E(a, b + h) − E(a, b)
Ew2 (a, b) = (a, b) := lim , (4.21)
∂ w2 h→0 h
and indeed for each individual dimension in the multi-dimensional case.
For a differentiable function of J variables, J such partial derivatives exist. They
are combined into a vector called the “gradient” of the function, gradE or ∇E (read
nabla E):
∂E ∂E
gradE (a, b) = (a, b), (a, b) . (4.22)
∂ w1 ∂ w2
The gradient direction is the direction of steepest ascent on the surface; i.e. it is
always orthogonal to the surface contour lines, pointing uphill. The length of the
gradient is local slope. The gradient is defined at every point in the domain of the
function, i.e. it establishes a vector field.
In gradient descent, minimization starts at a random initial position, marked as
w1 in Fig. 4.8b. The gradient is calculated at this position and the current weight
4.2 Classification 97

estimate is shifted in the opposite (downhill) direction. In standard numerical anal-


ysis, the optimal length of this shift is often determined by additional computations
which are generally not used in perceptron learning. In any case, the shift will lead
to a new estimate w2 of the weight vector with a reduced error. By iterating this pro-
cedure, a local minimum of the error surface can be reached. The process is stopped
if the gradient is close to the zero vector, i.e. if there is no remaining local slope.

The δ -Rule
In neural networks, as in biological learning, training is usually sequential. That is
to say, one stimulus is presented and the weight vector is adjusted. Then another
stimulus is presented and again a small learning step is carried out. For one such
step, we can therefore treat s in Equation 4.19 as a constant and reduce E by an
appropriate shift of w. We therefore calculate the partial derivatives of E(w) with
respect to the weights, using the chain rule:
     
∂E J
∂ J
(w1 , ..., wJ ) = 2 f ∑ s j w j − T (s) f ∑ s jw j . (4.23)
∂ wk j=1 ∂ wk j=1

Next, we substitute from Equation 4.18 and apply the chain rule a second time:
 
∂E J
∂ J
(w1 , ..., wI ) = 2 [e(s) − T (s)] f ∑ s j w j

∑ s jw j. (4.24)
∂ wk j=1 ∂ wk j=1

As in the previous section, we write u := ∑ s j w j . The derivative of the sum is simply


the coefficient sk since all other terms in the sum do not depend on wk . We obtain:

∂E
(w1 , ..., wJ ) = 2 [e(s) − T (s)] f  (u) sk . (4.25)
∂ wk
Equation 4.25 gives the gradient direction in the error landscape E. By comparison
with Equation 4.17, we see that this rule which was based on a simple heuristic,
does indeed move the weight vector in the negative gradient direction, i.e. downhill.
We now introduce the notation

δ := (T (s) − e(s)) f  (u) (4.26)

and obtain the so-called δ or Widrow–Hoff learning rule:

Δ wk = λ δ sk . (4.27)

In our iterative learning process, we start with some (maybe random) weight vector
w1 where the upper index denotes the current classification trial (Fig. 4.8b). We now
present a stimulus and obtain a response from which we calculate δ and apply the
learning rule. The new weight set will generate a reduced error if the last stimulus
is presented again. However, we now proceed to present the next stimulus and again
98 4 Artificial Neural Networks

calculate δ . Strictly speaking, the optimization in each step therefore refers to a dif-
ferent error function, namely the one for the current stimulus. In perceptron learning,
this somewhat surprising approach works better than a more systematic approach,
where each weight vector would be tested with all stimuli before performing the
next descent step. This is at least partially due to the problem of overlearning which
occurs for fixed training sets of stimuli. Optimizing to changing stimuli introduces
an element of randomness that renders the eventual result more robust.
For sets of perceptrons with shared input, i.e. a layer of perceptrons as depicted
in Fig. 4.7, we can consider each output neuron individually and obtain

Δ wi j := λ δi s j = λ (Ti − ei )s j . (4.28)

 output
e1 e2 ei = f (ui )
Fig. 4.9 Multi-layer perceptron  
with K = 3 input units, J = 6 6
4 hidden units and I = 2 out-
u1 u1 ui = ∑ j wi j h j
put units. Note that the input
and hidden layers are identi- 6
H 6
cal to the two-layer percep- H
@
@ H @
@ 2nd layer weights
trons depicted in Fig. 4.7. sk  @ HH@ wi j
input unit activities, v jk input-  @ HH@
to-hidden weights, r j potentials 
6 
6  6 
6
hidden
of hidden units, h j hidden unit h1 h2 h3 h4 h j = f (r j )
   
activities, wi j hidden-to-output
weights, ui potentials of output 6 6 6 6
units, ei output unit activity, f r1 r2 r3 r4 r j = ∑k v jk sk
static non-linearity.
6
a 6 6 !6
Q
AQaaAQQA!! 
Q a Q ! 
1st layer weights
A Q!Aa!a QA 
 v jk
 !Q
A!
  aa
A QA
6  6  6
s1 s2 s3 input sk


Multi-layer Perceptrons: Back-Propagation


In multi-layer perceptrons, the response of an output neuron is determined not only
by its own weight vector but also by the weight vectors of all hidden neurons. Still,
the overall approach used so far can be applied: we can formulate the error as a
function of all weights and than perform a gradient descent in the resulting error
landscape. With some calculation, it is straight forward to derive the following er-
ror minimization algorithm known as back-propagation learning (Rumelhart et al.,
1986). As before, we present a stimulus and compare the output with the teacher
signal. From this the output-layer corrections δi j can be obtained using Eq. 4.28
4.2 Classification 99



s2 6 CC CC CC 
• s2 6
◦•C C •C •
C C C 

•◦ C C C • 1•
 ◦

C C C
•◦ C C C
C C C ◦
• 
• -
◦C C C
• s1
•◦ 1
◦ C C C

C C C s- 1
a. C C C b.
Fig. 4.10 Support vector machine. a. Between the two data clusters, the linear decision
boundary (heavy line) is chosen such that the margin between the two clusters is maxi-
mized. In the two-dimensional case depicted here, this is obtained by choosing three points
(two from one cluster and one from the other) defining two parallel lines between which no
data points fall (thin lines). The midline between these parallels is the decision boundary.
b. Solution of the XOR-problem using the kernel (s1 , s2 , s1 s2 ) (Eq. 4.30) and the weights
(w1 , w2 , w3 ) = (−1, −1, 2) in Eq. 4.31.

and the according weight update Δ wi j = λ δi h j is applied. In addition, hidden-layer


corrections ε j can be obtained from the output-layer corrections via

ε j = f  (r j ) ∑ δi wi j , (4.29)
i

from which the hidden layer weights are updated as Δ v jk = λ ε j sk . Eq. 4.29 is called
“back-propagation” because the summation is over the first index of the weights, as
if the correction signals would run backwards from the output units to the hidden
units and were summed there according to the hidden-to-output weights. Of course,
this is just an elegant mathematical result of the gradient calculation in the multi-
layer case, not an actual neurobiological process.

4.2.5 Support Vector Machines


The back-propagation and δ -rules are derived from the requirement to minimize
the overall classification error in a training sample. One general problem with this
approach is overlearning and generalization, i.e. the performance of the classifier
with novel data. In this situation, it often turns out that back-propagation has picked
up on a feature difference which happened to work in the training set, but leads to
false classifications in novel data.
One systematic approach to avoid this problem is to look for classification bound-
aries maximizing the decision margin, i.e. the distance in feature space between the
boundary and the closest data point. This approach will give the highest possible
robustness in generalization, since novel data will differ from the training set in sta-
tistical ways. Fig. 4.10a illustrates the idea in the two-dimensional case. The data-
points defining the boundary are the “support vectors” after which the approach has
100 4 Artificial Neural Networks

been named. In the n-dimensional case, n + 1 such vectors will be needed to define a
(n − 1)-dimensional hyperplane and the margin. Algorithms for calculating the sup-
port vectors and decision boundaries from a set of data can be found in textbooks
of machine learning, e.g., Schölkopf & Smola (2002), Haykin (2008). These algo-
rithms can be relaxed to allow for a small number of data points to fall within the
margin, which is necessary if the two data clusters overlap.
So far, the support vector machine is yet another example of linear classification,
since the boundaries are again hyperplanes. Curved decision boundaries (non-linear
classification) can be achieved by the so-called kernel-approach in which the orig-
inal data vector, in the two-dimensional case s = (s1 , s2 ) , is replaced by a higher-
dimensional function Φ (s) such as

Φ (s) := (s1 , s2 , s1 s2 ) . (4.30)

This embedding in a high-dimensional space is reminiscent of the multilayer-


perceptron approach, where the input stimulus is re-coded by a number of hidden
units. In the example given in Equation 4.30, the original data points are mapped to
a three-dimensional space. In practical applications, embeddings to higher dimen-
sional spaces are used. The idea is, that linear separation in a higher dimensional
space is generally more powerful than in lower-dimensional spaces. In any case,
the optimal linear decision boundary in the high dimensional space is determined
according to the maximal margin procedure sketched in Fig. 4.10a. It leads to a
hyperplane given by the equation

(Φ (s) · w) − θ = 0 (4.31)

where w is the normal vector of the hyperplane and θ specifies its position. In the
example of Equation 4.30, Equation 4.31 can be rewritten as

w1 s1 + w2 s2 + w3 s1 s2 − θ = 0. (4.32)

Solving this equation for s2 yields a functional expression for curved decision sur-
face in the original (s1 , s2 ) feature space:

θ − w1 s1
s2 = . (4.33)
w2 + w3 s1
This equation describes a hyperbola with a pole at s1 = −w2 /w3 and the asymptote
s2 = −w1 /w3 approached as s1 goes to ±∞. This decision surface could for example
be used to solve the XOR-problem (see Fig. 4.10b). If other maps Φ are used, other
classes of decision functions can be obtained.

4.3 Associative Memory


Associative networks compute mappings between an input vector and an output
vector. This is a common task in the brain. For example, an input vector coding
4.3 Associative Memory 101

some image recorded in the retina may be associated with a motor output vector to
the larynx and vocal tract, causing the utterance of the name of a person depicted in
this image. In the vestibulo-ocular reflex, input data obtained from the semicircular
canals my be associated with (i.e. transformed to) a motor command for stabilizing
posture. In technical applications, mappings from multi-electrode recordings of the
motor cortex to the joint angles of the arm have been learned and used to control
the movement of arm prostheses with neural commands. The important question for
neural network theory is to find mechanisms, by which such association rules can
be learned.
The distinction between classification and association is not completely clear-
cut, especially if “perceptrons” with multiple output neurons are considered as in
Figs. 4.7 and 4.9. In this section, we will mostly consider linear neurons, i.e. identify
potential and excitation, and study the correlations between the various inputs and
outputs.

4.3.1 Topology: The Feed-Forward Associator


Consider a network composed of two subsets of units (layers), called input and
output layer (Fig. 4.11). The activity vectors are denoted by s = (s1 , ..., sJ ) for
the input layer and e = (e1 , ..., eI ) for the output. Within each layer, there are no
lateral connections; however, each cell in the output layer receives input from each
cell in the input layer, described by a weight ci j . Note that these conventions differ
from our previous definition (Equation 4.9): cii is not the coupling of a unit with
itself, but rather the coupling of two units—one in the input layer and one in output
layer—which happen to have the same index number. For a linear activation transfer
function (i.e. without non-linearity), we have:
J
ei = ∑ ci j s j or e = Cs. (4.34)
j=1

If (in agreement with Equation 4.9) we denote by W the connectivity matrix of


the whole set of neurons, i.e. (s1 , s2 , ..., sJ ; e1 , e2 , ...eI ) , the weight matrix of the
combined set becomes ⎛ ⎞
0JJ 0JI
W =⎝ ⎠, (4.35)
C 0II
where 0IJ is the I × J-matrix with all zero coefficients. W describes a complete
feed-forward connectivity. Note also that this topology is identical to the two-layer
perceptron with multiple outputs (Fig. 4.7) although the graphical representation in
Fig. 4.11 looks quite different.
102 4 Artificial Neural Networks

4.3.2 Example: A 2 × 3 Associator


Consider the associator shown in Figure 4.11. Assume that we want to implement
in this associator the input-output pair (s 1 = (1, 0) , e 1 = (1, 0, 1) ). As before, the
upper index marks a presentation number, or a presentation made at a certain time
step.
It is easy to see that this association is implemented by the weight matrix
⎛ ⎞
10
C1 = ⎝ 0 0 ⎠ , (4.36)
10

since ⎛ ⎞ ⎛ ⎞
10 1
1
e 1 = C1s 1 = ⎝ 0 0 ⎠ = ⎝ 0 ⎠. (4.37)
0
10 1
The weight matrix C1 was found by setting to the value of 1 all weights with indices
i, j for which si = 1 and e j = 1. In our example, where all activities are either 0 or
1, this is equivalent to saying
ci j = s j ei . (4.38)
The activities s j and ei are the pre- and postsynaptic activities with respect to the
connection ci j . The product of these activities is a standard component in formaliza-
tions with the Hebb learning rule mentioned above. It can be thought of as some kind
of “one-shot” learning where the weight is set and fixed after just one presentation
of stimulus and desired output. In matrix notation, we obtain
⎛ ⎞ ⎛ ⎞
c11 , c12 e1
C := ⎝ c21 , c22 ⎠ = ⎝ e2 ⎠ (s1 , s2 ) = e s . (4.39)
c31 , c22 e3


s1
 -
c11 -
c21 -
c31

Fig. 4.11 A 2 × 3 associator. Each unit in s2
the input layer (s1 , s2 ) is connected to each  -
c12 -
c22 -
c32
neuron in the output layer (e1 , e2 , e3 ). By
suitable choices of the connecting weights,
the network can be made to store “associ- ?
 ?
 ?

ations”, i.e. transform known input pattern e1 e2 e3
into arbitrary output pattern. 
4.3 Associative Memory 103

The multiplication of a column vector with a row vector, treating them as J × 1


and 1 × I matrices, respectively, is known as the outer product2 of the two vectors.
Equation 4.39 is therefore called the outer product rule. It states that associations can
be stored by calculating the outer product of output and input vector and using it as
a connectivity matrix. We will give a formally correct account of this idea below.
Of course, storing just one association pair does not lead very far. Let us therefore
assume that we want to store a second pair, e.g., (s 2 = (0, 1) ,e 2 = (0, 0, 1) ). From
the outer product rule, we obtain
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
0 0 00
0
→ ⎝ 1 ⎠ ⇒ C2 =e 2s 2 = ⎝ 1 ⎠ (0, 1) = ⎝ 0 1 ⎠ . (4.40)
1
1 1 01

It turns out that, at least in this example, the two connectivity matrices C 1 and C 2 can
simply be added up to obtain an associator that works for both pairs simultaneously.
⎛ ⎞
10
C = C(1) + C(2) = ⎝ 0 1 ⎠ . (4.41)
11

Easy calculation proves the result: e1 = Cs1 and e2 = Cs2 . We will see below that is
is always true if the input vectors are orthogonal to each other.

4.3.3 Associative Memory and Covariance Matrices


Let now s be a general stimulus vector, which we want to associate with an activ-
ity vector e on the output layer. We need to find a weight matrix C satisfying the
equation
e = Cs. (4.42)
Extrapolating from the example, we might expect that a matrix with this property
can be found by considering the outer product of input and output vector:
⎛ ⎞ ⎛ ⎞
e1 e1 s1 e1 s2 . . . e1 sJ
⎜ e2 ⎟ ⎜ e2 s1 e2 s2 . . . e2 sJ ⎟
⎜ ⎟ ⎜ ⎟
C = e ·s = ⎜ . ⎟ · (s1 , s2 , . . . , sJ ) = ⎜ . .. .. ⎟ . (4.43)
⎝ .. ⎠ ⎝ .. . . ⎠
eI eI s1 eI s2 . . . eI sJ

As the response of our network, we obtain:


! !
Cs = e ·s s = e s s = es2 . (4.44)

2 The outer product is also known as the tensor product. Its result is a matrix. Recall for comparison
that the dot product discussed in Section 4.1.3, c =x y, evaluates to a number. In contrast to the
outer product, the dot product is also called the inner product.
104 4 Artificial Neural Networks

This is the desired output, up to a factor s2 . We can incorporate this factor in the
weight matrix by choosing the coefficients ci j = ei s j s−2 .
In general, an associative network should store not just one association, but many.
We denote (as before) the various input and output vectors by raised indices, i.e. s p
is the p-th input vector. Here, p is mnemonic for presentation. For each presentation
(or association pair), we can calculate the connectivity matrix by the outer product
rule. As in the example, we now simply add up these outer products and obtain:
⎛ q q q q q q⎞
∑q e1 s1 ∑q e1 s2 . . . ∑q e1 sJ
⎜ ∑q e q s q ∑q e q s q . . . ∑q e q s q ⎟
⎜ 2 J ⎟
C = ∑e ·s = ⎜
q q 2 1 2 2
.. .. .. ⎟. (4.45)
q ⎝ . . . ⎠
q q q q q q
∑q eI s1 ∑q eI s2 . . . ∑q eI sJ

This matrix is essentially the matrix of the covariances that can be computed
between all input units and all output units, up to their respective means (cf. Equa-
tion 4.16). Note that in Equation 4.16 the covariance was calculated over the com-
ponents of a stimulus vector, whereas here, the sum runs over the presentations.
Mathematically, this is no problem, but the interpretation is not the same.
C in Equation 4.45 is a “mixed” covariance matrix of e and s. Later we will also
encounter the (standard) covariance matrix of s, (1/P) ∑P s ps p .
In order to test whether the matrix C has the desired property of realizing the
complete set of associations, we apply it to the p-th input pattern:
 
!
Cs = ∑e ·s
p q q
s p = ∑e q s q s p . (4.46)
q q

The term (s q s p ) is the dot product of the input vectors for the p-th and q-th as-
sociation pair. In order to produce the correct associations, this expression has to
evaluate to zero if p = q and to one if p = q. This is to say that the input vectors
must be pairwise orthogonal and of unit length. Therefore, if the associator has J
input lines, only up to J association pairs can be stored since no more than J pair-
wise orthogonal vectors can be placed in the J-dimensional feature space. For larger
numbers of P in Eq. 4.45, mixtures of outputs will occur.

4.3.4 General Least Square Solution


If the number of association pairs exceeds the number of input lines, one may still
ask for the optimal weight set, reproducing the desired outputs as closely as possible.
This question can be answered in the following way.
4.3 Associative Memory 105

Let P be the number of associations. We introduce the matrices


⎛ 1 2 ⎞
s1 s1 . . . sP1
⎜ s1 s2 . . . sP ⎟
⎜ 2 2 2⎟
S := [s 1 ;s 2 ; ...;s P ] = ⎜ . . .. ⎟ (4.47)
⎝ .. .. . ⎠
s1J s2J . . . sPJ

and ⎛ ⎞
e11 e21 . . . eP1
⎜ e1 e22 . . . eP2 ⎟
⎜ 2 ⎟
E := [e ;e ; ...;e ] = ⎜ .
1 2 P
.. .. ⎟ . (4.48)
⎝ .. . . ⎠
e1I e2I . . . ePI
These are simply obtained by writing the column vectors s(p) and e(p) one after the
other. Instead of Eq. 4.42, we may now write the relation for all presentations p
jointly as
E = CS, (4.49)
where E is a I × P, C a I × J, and S a J × P matrix.
If there are only J input pattern which are pairwise orthogonal, the matrix S will
be square and unitary, i.e. S−1 = S . We therefore obtain C simply from C = ES .
In the general case, i.e. if the input vectors are not orthogonal, one can show that the
best possible weight matrix for the desired associations is

C∗ = ES (SS )−1 . (4.50)

This matrix minimizes the squared error between the desired outputs E and the ac-
tual outputs C∗ . The matrix S (SS )−1 is called the Moore-Penrose pseudo inverse
of S. It also occurs in regression analysis, in the solution of general linear problems.

4.3.5 Applications
In this section, we have discussed very general ideas of associative memory, which—
on the level of our presentation—amounts basically to linear regression. This idea is
simple, but powerful, especially if large numbers of neurons, i.e. high dimensional
feature vectors are considered. We briefly discuss three cases.

Memory
Associative networks store the associations of an input pattern, or stimulus, and
an output pattern, or recall. The memory is content-addressable in the sense that
it retrieves items not just by reference number, but by some sort of meaning. For
example, one could code words (strings) of n characters into n groups of 26 input
neurons (one for each letter of the alphabet), and code k × l-pixel images into the
activities of kl output neurons. In a training session, each image is associated with
a written name and the network learns the covariances between the input neuron
106 4 Artificial Neural Networks

activities and the output neuron activities. If a string is presented at the input, the
associated image will thus form at the output.
Associative memory is also distributed in the sense that deleting a number of in-
ternal connections will not delete specific memory items, since the output is stored
in all covariances. Therefore, the result of partial deletion will be an unspecific de-
terioration of the output image, not the loss of individual images or memory items.
By the same token, miss-spellings of the input string can be corrected. Memories
with this property are also called “holographic”.
Both properties, content-addressability and distributedness, are highly plausible
for models of biological memories. For a classical reference, see Kohonen et al.
(1976).

Attractor Neural Networks


The simple associator of Fig. 4.11 can be turned into a network with complete con-
nectivity by feeding the output back to the “dendrites” in the upper part of the figure.
Since this feedback involves some time delay, it can be considered as an iteration of
an input being passed through the same connectivity matrix over and over again. In
the linear case, such “auto-associators” have been used to improve a degraded pat-
tern. In the non-linear case, so-called “attractors” will arise, i.e. special patterns of
activity onto one of which the global activity state will eventually converge. Since
the particular attractor reached depends on the input, attractor dynamics can be con-
sidered a mechanism of pattern classification (Hopfield 1982) .

Neuroprostheses
Robotic arms have been controlled by signals recorded from large number of neu-
rons in the motor- and parietal cortices of a monkey. Simultaneously with this
recordings, movements of the monkey’s are also recorded and encoded in terms
of the angles and angular velocities occurring at the shoulder and elbow joints. The
“association” between the neural activities and the arm movement can be learned
along the lines discussed above; indeed, this amounts to a linear regression analysis
of the arm movement as a function of neural activity. Once the covariance matrix is
known, new neural activities can be used to predict the “intended” arm movement
and drive a robotic arm with this signal. The results show substantial agreement be-
tween the monkey’s arm movements and the movements of the robot arm (Wessberg
et al. 2000).

4.4 Self-organization and Competitive Learning


For a synapse with pre- and postsynaptic activities s j and ei , respectively, the Hebb
learning rule is usually formulated as

i j = wi j + λ ei s j
wt+1 t
or W t+1 = W t + λe s (4.51)
4.4 Self-organization and Competitive Learning 107

where λ ≥ 0 is again a learning rate and t and t + 1 are time steps. The left equation
applies to an individual synaptic weight while the right is formulated for the entire
weight matrix. In the unsupervised case considered here, ei is simply the result of
the network activity, i.e. e = f (Ws) where f is the static non-linearity introduced in
Equation 4.4. If we substitute this activation dynamics into Equation 4.51 we obtain
a difference equation for W depending on the input pattern:

Δ W = λe s = λ f (Ws)s . (4.52)

If, as a first approximation, the activation function is assumed linear, we can omit the
function f and obtain Δ W = λ Wss . Since the sum of outer productsss taken over
time is the covariance matrix of the input set, we can expect that the weight matrix
eventually learned by the system will reflect the statistical properties of the stimulus
set. Indeed, in competitive learning, it can be shown that neural networks have the
ability to form statistically optimal representations by means of self-organization.

s2
6 w + λ ets t p p
Fig. 4.12 Geometric interpretation of Oja’s
p
learning rule, Eq. 4.54. At time t, the weight p 
p
vector is w(t). Assume now that a stimulus s t p
p 
is delivered. The new weight vector w t+1 is ob-
s
p 
t
p
tained by adding a certain multiple of the stim- pw t+1
ulus vector to the old weight vector and nor-  p p 
7

   w t
malizing the sum. The tip of the weight vec-
tor moves on a circle (due to the normalization)  
towards the side where the stimulus occurred.


If the system reaches a steady state, the weight 
vector will be pointing to the center of the cloud 
 -
of input vectors delivered to the system during s1
training.

4.4.1 The Oja Learning Rule


One problem with the Hebb learning rule as formulated above is that the weights
will grow without limit if appropriate stimuli are presented. To avoid this, one may
postulate that the norm of the weight vectors, ∑ j w2i j , is kept constant, say 1, during
the entire learning process, so that if one weight is increased, all others will be
decreased in an unspecific way.
Here, we consider a version of this idea that leads to mathematically straight-
forward results. This rule was suggested by Oja (1982) for the input weights of
just one neuron without non-linearity. The “network” topology is that of a simple
linear neuron or two-layer perceptron where the weight matrix reduces to the weight
vector of this neuron. The situation is modeled by the activation function
108 4 Artificial Neural Networks

Fig. 4.13 Principal component analysis (PCA).


s2
A sample of input vectors s p for a network p2
with N input units forms a cloud of dots in the
p1
N-dimensional feature space. In many applica-
tions, it is useful to recode these vectors in a
new coordinate system whose first axis is the
axis of longest elongation of the cloud of dots.
The further axes are determined as being or-
thogonal to the already defined axes and having
the largest elongation subject to the orthogo-
nality constraint. Mathematically, these axes are s1
computed as the eigenvectors of the covariance
matrix of the data set. They are called principal
components of the data set.

J
et := ∑ w j stj = w s t (4.53)
j=1

and a set of input vectors presented as a temporal sequence s t := (st1 , ..., stJ ) . We
now introduce the normalizing Hebbian rule (also known as Oja’s learning rule):

wtj + λ et stj w t + λ ets t


wt+1 = ; w t+1 = . (4.54)
j
∑Jj=1 (wtj + λ et stj )2 w t + λ ets t 

The denominator guarantees that w = 1 at all times.


The resulting development of the weight vector is illustrated in Fig. 4.12 for a
neuron with just two input lines. Due to the normalization, the total length of the
weight vector is constant; that is to say, during learning, all the weight vector can do
is move with its tip on a circle, or (hyper-)sphere. The changes of the weight vector,
Δ w = w t+1 − w t will therefore always be orthogonal to w, i.e. tangent to the sphere
of which w t is a radius.
We now use this property to reformulate the learning rule Equation 4.54, thereby
dropping the time argument. Without the normalization, Δ w consists only of the
Hebbian term es, multiplied with the learning rate λ . The normalization can be
approximated by subtracting a vector from λ es which lies in a plane with w and s
and makes the total weight change orthogonal to w. It is easy to prove that these
conditions are satisfied by
Δ w = λ e[s − ew] (4.55)
since w = 1. Equation 4.55 can be formally derived from Equation 4.54 by a
Taylor expansion of Δ w about λ = 0. Alternatively, Δ w in Equation 4.55 can be
considered the result of a Gram-Schmidt orthogonalization of w and s. For our dis-
cussion, it suffices to take it as an approximation of Equation 4.54.
4.4 Self-organization and Competitive Learning 109

Next, we substitute from Equation 4.53, keeping in mind that w s =s w due to
the commutativity of the dot product. This yields

s w − 
Δ w = λ [s w s 
s w w]. (4.56)
e e e

Due to the associativity of matrix multiplication, we may bracket the outer prod-
ucts ss . In the temporal average, i.e. for many stimulus presentations, these outer
products become the covariance matrix of the training set, C := (1/tmax ) ∑t ss . We
obtain:
Δ w = λ (Cw − (w Cw)w). (4.57)
The weight vector will converge when Δ w = 0. For the resulting weight set, we
obtain
Cw = (w Cw)w. (4.58)
The term in braces, (w Cw), is a number, let’s call it μ . Equation 4.58 therefore
becomes the eigenvalue3 equation Cw = μ w, i.e. the weight vector will converge to
an eigenvector of C. Moreover, Oja (1982) has shown that w will converge to the
eigenvector with the largest eigenvalue, i.e. the perceptron will evaluate the loading
(coefficient) of the first principal component of the data set.
This result means that by the normalizing Hebb rule, a perceptron will automat-
ically adjust its weight vector to the first principle component of the data set. If we
think of the set of input data as a cloud of dots in feature space, the first princi-
ple component is the longest axis of this cloud (Figure 4.13). This result was to be
expected from our graphical consideration in Fig. 4.12 since the weight vector is at-
tracted by all stimulus vectors in cloud and thus ends up in the center of that cloud.
Since the activation function of the perceptron performs a projection on the weight
vector (Section 4.2.2), this result therefore means that the Oja rule automatically
finds the axis representing the largest possible variability of the data set in just one
number. Thus, the information conveyed is maximized.
Note that learning only occurs if the output e is positive. This means that the
perceptron needs to start with a weight vector not too distant from the stimulus.
Also, if the data set consists of two distinct and sufficiently distant clouds of dots,
the weight vector will pick up on the one which is closer to its start position.

4.4.2 Self-organizing Feature Map (Kohonen4 Map)


There are various ways to generalize these ideas to networks with many neurons.
Simply adding another output neuron does not help, because in this case both weight
vectors might converge to the same direction in feature space, which is clearly not
an optimal representation.
3 The notion of an eigenvalue in matrix theory is completely analogous to its usage in Fourier
theory, Section 3.2. The matrix is considered a mapping between vectors. For a vector and a
number satisfying Mv = λv, v is an eigenvector and λ is the corresponding eigenvalue.
4 Teuvo Kohonen (born 1934). Finnish mathematician and computer scientist
110 4 Artificial Neural Networks

Let us consider a linear hetero-associator (Fig. 4.14a), i.e. a feed-forward pro-


jection of an input layer with stimuli s = (s1 , s2 , ..., sJ ) to an output layer e =
(e1 , e2 , ..., eI ) . Inside the output layer, there are no lateral connections, but the out-
put neurons are arranged in a grid, such that a distance between any two output
neurons is defined. This distance function is a novel feature which was not con-
sidered in previously discussed neural network models. It may be used to define a
neighborhood Nk of each unit k as the set of units whose distance to unit k is less
than a certain threshold. As in Section 4.3.1, the activation dynamics is given by an
input-output weight matrix C = {ci j }i≤I, j≤J :
J
ei = ∑ ci j s j :=c i s. (4.59)
j=1

Here, ci = (ci1 , ci2 , ..., ciJ ) is again the vector of all input weights of unit i, i.e. the
receptive field of the unit. The so-called Kohonen map uses the following competi-
tive weight dynamics. Let
k := argmax{ei } (4.60)
1≤i≤I

be the index number of the output unit generating the strongest excitation after a
given stimulus presentation. This unit will be the one whose weight vector most
closely resembles the input vector; it is usually the “winner”-unit. After each stim-
ulus presentation, the winner unit is selected and subjected to a learning step which
will make it react even stronger when the same stimulus occurs the next time. Learn-
ing is also applied to the units in the neighborhood of the winner, i.e. these units as
well will react stronger if the last stimulus re-occurs:

ci·t + λ ts t λo
for all i ∈ Nk : ci·t+1 := ; λ t := ∈ IR. (4.61)
ci·t + λ ts t  t

The learning rule itself is again a normalizing Hebb rule. Thus, there are two points
where competition occurs, first in the selection of a “winner neuron” and its neigh-
borhood which are eligible for learning, and second among the various input weights
of these neurons since weight increase at one input line results in unspecific reduc-
tion of all other weights. The time-dependence of the learning rate is introduced to
enforce convergence of the procedure. This type of enforced convergence is also
known as “annealing”.
Fig. 4.14b,c illustrate the learning process in a self-organizing feature map. The
coordinate system plotted represents the unit-hypersphere on which the weight vec-
torsci move. In the three-dimensional case, the coordinates σ1,2 on this hypersphere
can be thought of as elevation and azimuth. The shaded area shows the region
in feature space where stimuli are likely to occur. The network starts with a ran-
dom initialization (Fig.4.14b) where the (σ1 , σ2 )-coordinates of each output unit
are independent of the unit’s respective position in the neighborhood grid, the map
therefore looks scrambled. During learning, the weight vectors move into the gray
area because at each learning step, the winner and its neighbors are attracted by a
4.5 Suggested Reading 111
  r 
 r 
 r  
   σ2 6 σ2 6
e1i ei 2 ei 3 ei 4 ei 5 e6i cw
g
1
g
w
cH
2

}
Z @ HHcw
g
BM Z@ 6 J
I ]
BM 6 
>
@ A4
B Z@ 
J B  cw
g2 cw
g @c
g
w
c11 B Z
@J B c62
1
J cw
3
A AA
g cw
g
B 
Z @JB  -
H 6-
3JAAHHcw AA 
6
Z cw
g g σ1 cw
g σ
B

@
ZJB H HJ 5 5 1

a. s1i s2i b. HJ
A
cg
w4 c.

Fig. 4.14 Self-organizing feature map. a. Network topology with input layer (s j ), input
weights (ci j ), map layer (ei ) and map layer adjacencies, in this example with the topology
of a 2 × 3 grid. b. Unit-hypersphere in the input space, with coordinates σ1 , σ2 . For each
map unit ei , its input weight vector (ci1 , ci2 ) is marked as ci . The heavy black lines show
the map layer adjacencies. The weights are initialized at random, leading to an unordered
arrangement of weight vectors. The gray area marks the region in feature space where input
probability is high. c. After learning, the weight vector grid has “crawled” into the interesting
area of feature space. It also has “unfolded” such that adjacent units now have similar weight
vectors (i.e., receptive fields).

stimulus vector from within that range (Fig. 4.14c). At the same time, the map un-
folds and ends up in a “neighborhood-preserving” (or continuous) fashion where
adjacent output neurons have similar weight vectors. This continuity is a result of
the shared learning in the neighborhoods. Input pattern with high frequency of oc-
currence will be represented by more units (larger grid regions) than rare pattern.
Kohonen maps or similar forms of competitive learning underly a large class of
models for self-organization of representations, for example for orientation columns
in the visual cortex or for representational plasticity in the somatosensory cortex. For
examples, see von der Malsburg (1973) or Antolik & Benard (2011).

4.5 Suggested Reading

Books
Arbib, M. (ed.), (2002). The Handbook of Brain Theory and Neural Networks. 2nd
edition, Cambridge, MA: The MIT Press
Haykin, S. (2008). Neural Networks and Learning Machines. 3rd edition, Upper
Saddle River, NJ: Pearson Prentice Hall.
Minsky, M. L. and Papert, S. A. (1988). Perceptrons, expanded edition. Cambridge,
MA: The MIT Press.
Shepherd, G., and Grillner, S. (eds.), (2010) Handbook of Brain Microcircuits. Ox-
ford University Press
112 4 Artificial Neural Networks

Original Papers
Antolik, J., Bednar, J.A. (2011) Development of maps of simple and complex cells
in the primary visual cortex. Frontiers in Computational Neuroscience 5 — 17
— 1-19
Recent example of a self-organization model for cortical orientation selectivity,
taking into account differences between cortical layers and cell-types.
Buchsbaum, G., Gottschalk, A. (1983) Trichromacy, opponent colours coding and
optimum colour information transmission in the retina. Proceedings of the
Royal Society (London) B 220:89 – 113.
Shows that color opponency can be interpreted as a recoding of the cone re-
ceptor signals via principle component analysis. One of the most compelling
examples of an application of this theory in neuroscience.
Carreira-Perpiñán, A. Á., Lister, R. J., Goodhill, G. J. (2005) A computational model
for the development of multiple maps in primary visual cortex. Cerebral Cortex
15:1222 – 1233
This paper uses a slightly different approach to self-organization to model the
interlaced structured of cortical maps for various neuronal specificities. The
results are in good agreement with neurophysiological data.
von der Malsburg, C. (1973). Self–organization of orientation sensitive cells in the
striate cortex. Kybernetik, 14:85 – 100.
One of the first papers demonstrating that the formation of cortical orientation
columns might be a result of competitive learning.
Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T. (2007) Robust object
recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analy-
sis and Machine Intelligence 29:411 – 426.
Modern classificator model based on the multi-layer perceptron. The layers are
interpreted as areas of the visual and temporal cortices. Pattern classification
abilities are tested with real images.
J. Wessberg, C. R. Stambaugh, J. D. Kralik, P. D. Beck, M. Laubach, J. K. Chapin,
J. Kim, S. J. Biggs, M. A. Srinivasan, and M. A. L. Nicolelis. Real-time pre-
diction of hand trajectory by ensembles of cortical neurons in primates. Nature,
408:361 – 365, 2000.
Large numbers of neurons are recorded from the cortex of a monkey and the
activities are correlated to joint angles and joint velocities of the monkey’s
forearm. By statistical methods akin to those described in this chapter, arm
movements can be predicted from the neural activities. The quality of these
predictions is assessed by using them to control a robot arm and compare the
movements of robot and monkey arm.
Chapter 5
Coding and Representation

Abstract. Neuronal activity does generally not in itself contain information about
the stimulus. Spikes elicited by different stimuli look quite the same, but do gener-
ally occur in different cells. The information is contained in the specificity, or tuning
curve of the cell generating the spike. This may be considered a modern account of
the notion of the “specific sense energies” formulated by Johannes Müller1 already
in 1826. In effect, any stimulation of the eye (or visual pathways) is perceived as
visual and any stimulation of the ear (or auditory pathways) as auditory. The in-
formation encoded by each neuron is described by its tuning curve. Often, neurons
are tuned simultaneously to different parameters such as position in visual space,
edge orientation, and color, albeit to various extends (i.e., with sharper or coarser
tuning curves). Tuning curves of different neurons overlap, leading to population
coding where each stimulus is represented by the activities of a group, or popula-
tion, of neurons. The first part of this chapter explores the consequences of popula-
tion coding for neural information processing. In the second section, we study the
fact that neighboring neurons in the cortical surface tend to have similar receptive
fields and tuning curves. This similarity is defined for a combination of many pa-
rameters including position in the visual field as well as the well-known “perceptual
dimensions” orientation, spatial frequency, color, motion, and depth. It is particu-
larly evident for the tuning to visual field position, i.e. in retinotopic mapping of the
visual field onto the cortical surface in the centimeter range.

5.1 Population Code


5.1.1 Types of Neural Codes
Action potentials do not in their shape encode information about the stimulus.
Rather, neural coding is based on the temporal pattern of neuronal firing which,
for a single neuron, is represented as a spike train. As explained in Section 2.1.5,
we may derive from the spike train a time-dependent spike rate which we identify
1 Johannes Peter Müller (1801 – 1858). German physiologist.

H.A. Mallot: Computational Neuroscience, SSBN 2, pp. 113–129.


DOI: 10.1007/978-3-319-00861-5_5 c Springer International Publishing Switzerland 2013
114 5 Coding and Representation

with our excitation variable e(t). In this section, we assume that spike rate carries
the major part of neural information, as is generally also assumed in single-cell
neurophysiology. A related variable is the local field potential measured by low-
resistance extracellular electrodes which approximately is a local spatial average of
spike rate. In addition to spike rates and local field potentials, higher-order pattern
in spike trains have been studied which are revealed by auto- and cross-correlation
(cf. Equation 2.57) or other techniques from time series analysis.
Consider a continuous variable taking values in the interval (0, 1). In principle,
there are at least three ways of coding such variables in neural activity, or spike rate:
1. Intensity code: The activity of the neuron is proportional to (or a monotonic
function of) the coded parameter. Image contrast is coded in this way, leading to
stronger activities if contrast is increased. One problem of this coding scheme is
that the dynamic range of the neuron (0 to < 500 spikes per second) determines
the resolution of the code.
2. Channel coding or labeled line code without overlap (Fig. 5.2a): The set of all
possible stimulus values—in our example the interval (0, 1)—is partitioned into
n parts, each of length 1/n. For each part i, a detector neuron exists which is
n , n ). The tuning
active if and only if the stimulus falls within the interval ( i−1 i

curve of each detector neuron is one on this interval and zero elsewhere.
3. Population code or labeled line code with overlap (Fig. 5.2b): Each stimulus is
coded by a set or population of neurons each of which is tuned to a section of
the coded stimulus. The tuning curves may overlap. This is the standard way of
neural coding.
Population codes need more than one neuron to encode a parameter value, whereas
in simple channel coding, one neuron may suffice. Despite this apparent disadvan-
tage, it can be shown that population codes are superior to non-overlapping channels
in many respects. We will now turn to the question of information content.

5.1.2 Information Content of Population Codes


In information theory (e.g. Cover and Thomas 1991), the information of a mes-
sage is defined as the average number of binary questions (i.e. questions that can
be answered either “yes” or “no”) needed to reconstruct that message. The unit of
information, i.e. the number of binary questions needed, is the “bit”. For exam-
ple, if the message is a string of zeros and ones, both numbers occurring equally
likely, then each number contains one bit of information. In this case, the question
to ask is “is the symbol a one” (or “is the symbol a zero”). Now consider the case
that the probability of a one is only q1 = 0.25 while the probability of a zero is
q0 = 0.75 (see also Fig. 5.1). In this case, one might start by asking “are the next
two digits 00?” The probability that this is true is p1 = 0.752 = 0.5625. If the answer
is “no”, the next question could be “are the next two digits 01?”. The probability that
this is true is p2 = 0.25 × 0.75 = 0.1875. If the answer is “no”, the third question
would be “are the next two digits 01?”. Since any answer to the third question will
make the result clear, the probability of being done after exactly three questions is
5.1 Population Code 115

i pi i pi
yes 9 9
Q1 : is it “00” ? - 1
16 16
no yes 3
- Q2 : is it “01” ?
" - 2
6
16 16
no yes
- Q3 : is it “10” ?
" -⎫
⎬ 4 12
3
no ⎭ 16 16
" -
average number 27
of questions asked 16

Fig. 5.1 Questioning scheme for recovering a sequence of the characters “0” and “1” appear-
ing with probability 3/4 and 1/4 respectively. i, number of questions asked; pi , probability of
finding the answer after the i-th question. Each cycle recovers two digits by asking on average
27/16 = 1.6875 questions, i.e. roughly 0.84 binary questions per digit.

p3 = 0.25 × 0.75 + 0.25 × 0.25 = 0.25. With this question answered, we have re-
constructed the next two digits. The average number of questions to be asked in this
example is

I = p1 + 2p2 + 3p3 (5.1)


9 3 4 27
= +2 +3 = = 1.6875. (5.2)
16 16 16 16
Per digit, this message therefore contains only 1.688/2 = 0.844 bits.
The optimal way of asking questions is such that each question comes out “yes”
or “no” about equally likely. By asking n such questions, one arrives at alternatives
with probability 0.5n . Put differently, if pi is the probability of a given alternative,
the number of questions asked to get its outcome in the optimal case is log0.5 pi =
− log2 pi . The average number of questions asked in an optimal scheme is therefore
n
H := − ∑ pi log2 pi . (5.3)
i=1

H is called the Shannon2 entropy, or information content, of the distribution


(p1 , ..., pn ). It is assumed that the digits of the message occur statistically inde-
pendent of each other.
We may now apply these ideas to different neural coding schemes. We will as-
sume that the stimulus parameter q is uniformly distributed on the interval (0, 1).
Consider first the labeled line code without overlap and n equally spaced channels
(Fig. 5.2a). Each channel has the characteristic, or tuning curve

2 Claude E. Shannon (1916 – 2001). United States mathematician and engineer.


116 5 Coding and Representation

ρ1 6 ρ1 6
- -
ρ2 6 ρ2 6
- -
ρ3 6 ρ3 6
- -
ρ4 6 ρ4 6
- -
q q
0 .25 .5 .75 1 0 .25 .5 .75 1
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎛ ⎞⎛ ⎞⎛ ⎞⎛ ⎞⎛ ⎞⎛ ⎞⎛ ⎞
1 0 0 0 1 1 1 1 0 0 0 0
⎜0⎟ ⎜1⎟ ⎜0⎟ ⎜0⎟ ⎜ 0 ⎟⎜ 1 ⎟⎜ 1 ⎟⎜ 1 ⎟⎜ 1 ⎟⎜ 0 ⎟⎜ 0 ⎟⎜ 0 ⎟
⎝0⎠ ⎝0⎠ ⎝1⎠ ⎝0⎠ ⎝ 0 ⎠⎝ 0 ⎠⎝ 1 ⎠⎝ 1 ⎠⎝ 1 ⎠⎝ 1 ⎠⎝ 0 ⎠⎝ 0 ⎠
0 0 0 1 0 0 0 1 1 1 1 0

a. b.

Fig. 5.2 Information transmission in channel-coded systems without overlap (a.) and with
overlap (b.) With overlap, more information can be transmitted. The vertically printed vectors
are the channel activities (ρ1 (q), ρ2 (q), ρ3 (q), ρ4 (q)) for each q-Interval. For further expla-
nation see text.


n <q≤
1 if i−1 i
ρi (q) = n . (5.4)
0 otherwise

By coding the signal in this scheme, it is digitized to steps of 1/n, each encoded by
one of the activity patterns (1, 0, .., 0), (0, 1, ..., 0), ..., (0, 0, ..., 1). Smaller differences
cannot be resolved.
The probability of each of these n activity patterns to occur is 1/n, since we
assumed that the parameter q is equally distributed in the interval (0, 1). With n
equally likely activity patterns, we may calculate the entropy
n
1 1
H =−∑ log2 = log2 n. (5.5)
i=1 n n

Information content increases with the number of channels, but only very slowly;
information content per channel decreases as the number of channels increases.
Consider now a coding scheme with overlapping channels as shown in Fig. 5.2b.
The width of each tuning curve is 1/2, and the tuning curves of the n channels are
shifted by 1/(2n):
2n < q < 2n + 2 .
1 if i−1 i−1 1
ρi (q) = (5.6)
0 otherwise
As can be seen from Figure 5.2b, there are are 2n different activity distributions (or
code words) in this case. Therefore, the entropy is:
5.1 Population Code 117

2n
1 1
H =−∑ log2 = 1 + log2 n. (5.7)
i=1 2n 2n

As compared to the case without overlap, information content is increased by one


bit. Put differently, carrying the same information in the interval code would require
twice as many neurons. Further increase can be obtained by allowing for channels
with graded activities. For example, the three channels in color vision (short, middle,
and long wave-length cones) allow to distinguish between more than a million col-
ors. In contrast, graded activities do not add any information in the non-overlapping
case.

5.1.3 Reading a Population Code: The Center of Gravity


Estimator
For the nervous system, having the information about a stimulus represented in an
activity pattern over a population of neurons is perfectly okay. For the experimenter,
however, it would be helpful to recover the original signal from the population ac-
tivity, in order to know what is actually represented in a given activity pattern.
Assume we are encoding a parameter p, where p can be anything, from the fre-
quency of a light wave to the retinal position of a stimulus, or, indeed, the direction
of a planned arm movement. The parameter is represented by a population of n neu-
rons, each with a preferred parameter value p̂i (read: p hat sub i). The preferred
parameter value is the one for which the neuron’s tuning curve takes its maximum.
In color vision, these are the peak positions of the quantum catch functions of the
cone receptors. The preferred stimulus values must be known to the experimenter in
advance, e.g., from an independent measurement of tuning curves. We now present
a particular stimulus value, say q, and take the resulting activities of each neuron,
ei (q), as a “vote” for the neuron’s preferred stimulus. The votes are weighted with
activity strength and summed together. Mathematically, we can describe the proce-
dure by
∑ p̂i ei (q)
q∗ = i (5.8)
∑i ei (q)
where q∗ is the estimate for q. If only two neurons are considered with p̂1/2 = ±1,
Equation 5.8 describes a balance with two weights corresponding to the activities
of the neurons and q∗ marks the point along the arm of the balance where it needs
to be supported in order not to tilt to either side. The right side of Equation 5.8 is
therefore called the “center of gravity”-estimator.
Of course, it would be nice to prove in Equation 5.8 that the estimate q∗ is indeed
equal to the original parameter value q. This, however, depends on the detailed shape
of the tuning curves ρi (q); for example, with cosinusoidal tuning curves, the center-
of-gravity estimator is generally biased. The question of the exactness of the center-
of-gravity estimator can be considered as an instance of the general problem of
function approximation with the tuning curves as “basis functions”. We will not
elaborate on this, but study a simple example illustrated in Fig. 5.3.
118 5 Coding and Representation

6 ρ1 ρ2 ρ3 ρ4
@ @ r@ @ -
@ @r @ @ -
@ @
r @ @@ -
@ @ @
@r -
@ @ @ @ @ - -
6 p 1 2 6
3 4 p̂
1 2 3 4 5
a. q b. q∗

Fig. 5.3 a. An array of equidistant triangular tuning curves (Equation 5.9). Also shown is
a stimulus value q which excites all four channels according to their tuning curves. b. The
activities generated in these four channels. Their center of gravity equals the original stimulus.

Suppose a population of neurons with triangular tuning curves ρi (p) centered at


the preferred stimulus p̂i ,

1 − 12 |p − p̂i| for |p − i| < 1
ρi (p) = . (5.9)
0 otherwise

Fig. 5.3a shows a set of four such tuning curves where the preferred stimuli of each
neuron have been identified with the neuron’s number, i.e. p̂i = i.
Now, consider a stimulus with parameter value q; in Figure 5.3a it appears be-
tween the preferred stimuli of neurons 2 and 3. From Equation 5.9 it is easy to
calculate the activities in the four channels. They are shown in Figure 5.3b as gray
columns. As explained before, we consider these columns to be loads placed on the
beam of a balance. Since we now have more than two channels, each load is placed
at the position corresponding to the channel’s preferred stimulus. The definition of
the center-of-gravity as support-point of the beam remains the same. In the case
of triangular tuning curves, it is easy to show that the center-of-gravity estimator
actually recovers the encoded parameter value, q = q∗ in Equation 5.8.
If the equality q = q∗ does not strictly hold, the center-of-gravity estimator is
still useful as an approximation; for an interesting example, see Kay et al. (2008).
In addition, more sophisticated methods of estimation theory have been applied to
the problem, most notably the idea of maximum likelihood estimation. It interprets
tuning curves as conditional probabilities for neuronal firing given that a particular
stimulus was presented. For a recorded pattern of population activity, one can then
calculate for each possible stimulus the probability of the recorded pattern to occur.
This probability is called the likelihood3 of the stimulus. Finally, one picks the one
stimulus generating the largest likelihood.

3 Note that the likelihoods of all stimuli need not sum to unity. This is why the term “likelihood”
is introduced, to distinguish the quantity from a true probability.
5.1 Population Code 119

??

 
 
  q
q

    q
b.

a.

Fig. 5.4 Hyperacuity (or sub-pixel resolution) in a population code for visual position. a. The
arrows mark visual positions for two ideal point lights. By the imaging properties of the eye,
they are washed out into two blurring disks on the retina (shown as Gaussians). The width
of blurring under ideal conditions is about 0.5 minutes of arc. These pattern excites the cone
photoreceptors which in the fovea have a diameter of again 0.5 minutes of arc. Each blurring
disc will therefore excite a small population of adjacent cone receptors. The activity pattern
over these group differs even if the light positions are less than 0.5 minutes of arc apart.
b. Test patterns for hyperacuity: left, vernier, middle, arc, right, row of dots. Well trained
subjects correctly judge offset directions for offsets below 10 seconds of arc, i.e. less than
one third of the cone diameter.

5.1.4 Examples, and Further Properties


Hyperacuity (Sub-Pixel Resolution)
If visual position is considered as a stimulus parameter, the receptive field functions
as were studied in Chapter 2 can be identified with the tuning-curves for the param-
eter “space”. The perception of visual position from overlapping receptive fields is
thus an example of population coding, since the position of a point stimulus is en-
coded not by the activity of just one neuron, but by the population activity of all
neurons whose receptive field includes the position in question.
Fig. 5.4 shows population coding of visual position already on the level of retinal
cone receptors. The set of all visual locations from which a given cone receptor can
be stimulated is determined by the width of the cone receptor (about 30 seconds of
arc in terms of visual angle or 1.7 μ m in terms of retinal distance) and the width
of the blurring disk of a light beam arriving at the cornea. For ideal imaging condi-
tions (iris light adapted, exact accommodation of lens), this is again in the order of
30 seconds of arc. Fig. 5.4a shows this situation for two locations less than 30 sec-
onds of arc apart. In either case, three adjacent cone receptors will be activated, but
the amount of activation differs between the two locations. The population activ-
ity, or the center-of-gravity estimator calculated from it, maintains the information
about the separation between the two stimuli (cf. Poggio et al. 1992). In a labeled
line code without overlap, stimulus locations can only be told apart if they exceed
two cone widths.
120 5 Coding and Representation

Fig. 5.4b shows some patterns which may be used to measure visual resolution.
The versions of the patterns depicted, or their mirror images, are shown to the test
subject. Then the subject is asked whether the upper line is to the left or the right
of the lower one (example at the left); in which direction the arc is bent (middle
example), or whether the middle of the three points is to the left or the right of the
line connecting the two other points (right example). All of these experiments reveal
perceptual thresholds on the order of 10 seconds of arc or less, i.e. well below the
resolution of the cone mosaic.

Fig. 5.5 Demonstration of the


shift in perceived spatial fre-
quency after adaptation. After
closing one eye and looking at
the small cross at the left for
at least one minute, then look-
ing at the cross on the right, the
two striped patterns at the right
appear to have different spa- + +
tial frequencies (Adapted from
Blakemore & Sutton 1969).

After-Effects and Work-Range Adjustment


Sampling a parameter space with overlapping tuning-curves allows to adjust the
sensitivity of parameter contrasts to the variability or statistics of the input signal.
On a relatively short time scale (minutes), this is evident from so-called after-effects
which are an ubiquitous phenomenon in perception. Fig. 5.5 shows an example from
the perception of the spatial frequency, or granularity, of a grating. After monocu-
larly fixating the left fixation cross for a minute or so, the upper and lower parts of
the right display appear to differ in spatial frequency. The visual system seems to
adapt to the spatial frequency on the left side and after adaptation conveys the new
spatial frequencies in relation to the ones adapted to. Similarly, after adapting to
the downward movement in a waterfall, still objects appear to move upward; after
adapting to a red pattern, the same pattern shown only by a black outline on a white
background appears greenish, etc.
Neurophysiologically recoded tuning curves support the idea that many sensory
parameters such as orientation and spatial frequency of visual textures, motion,
5.1 Population Code 121

a. qo b. qad > qo c. qo
? ? ?

- - -
p̂1 p̂2 p p̂1 p̂2 p p̂1 p̂2 p

- -
 p̂  p̂
q∗ = qo q∗ < qo

Fig. 5.6 After-effects are a common phenomenon in the perception of motion, color, texture
granularity, etc. A standard explanation assumes that perception is based on the population
activity of channels with overlapping tuning curves. a. A stimulus qo is presented intermedi-
ate between the preferred stimuli p̂1 and p̂2 of two channels. The resulting population activity
shows two equal excitations and the center of gravity estimator q∗ truly reproduces the origi-
nal stimulus qo . b. The system now adapts to a new stimulus, qad . Adaption is modeled as a
reduction of the sensitivity of active channels. c. If the original stimulus is again presented to
the adapted system, the activity of the previously active channel will be reduced. The center-
of-gravity estimator is therefore displaced away from the adapting stimulus, i.e. to the left.

color, or depth, are represented in a population code. In this coding scheme, adapta-
tion is modeled as an overall decrease of sensitivity caused by strong and sustained
activity of a given channel in the adaptation phase. This decrease affects the chan-
nels most closely tuned to the adapting stimulus and may be due to fatigue, learning
processes or other mechanisms. In any case, the sensitivity of the channels tuned to
the adapting stimulus will be reduced in the test phase of the experiment. Thus, the
population code will be biased away from the adapting stimulus (Fig. 5.6). In on-
going perception, channel adaptation will lead to an improved detection of changes,
or of deviations from the average. The perception of constant stimuli will be sup-
pressed while more sensitivity will be assigned to parameter ranges with higher
variability.

Vector-Valued Parameters and Interpolation


The theory of population coding is not restricted to stimuli varying in just one pa-
rameter. As an example, we consider the encoding of pointing movements of the
arm, which can be described in a two-dimensional parameter space of azimuth and
elevation from the shoulder (Georgopoulos et al. 1993). Cells in the motor cortex are
active prior to arm movements into certain portions of the grasp space. One can de-
fine the motor field of a cell as the probability of each movement given that the cell
was active. These motor fields are analogous to tuning curves in the sensory systems.
Different cells in the motor cortex have different, but overlapping motor fields. One
may now ask the question whether it is possible to predict the motor action based
on the activities of a large number of motor cells. To do this, the motor fields are
first determined and for each cell, the preferred motion vector is identified. Then, if a
122 5 Coding and Representation

Fig. 5.7 Population coding


for multi-dimensional param-
p2
eters. In a two-dimensional 6
parameter space p1 , p2 , two
units (channels) are marked
with preferred stimuli p̂1,2 . In
p̂1 r
this case, the tuning curves
ry
become two-dimensional  ry
ry
functions ρi (p1 , p2 ) which ry
ry
peak at (p1 , p2 ) = p̂i . Pat- :

 r
 
 p̂
tern of population activity 

 

2
are shown for a number


of intermediate stimuli 
 -
qλ := λ p̂1 + (1 − λ )p̂2 , with p1
λ = 0, 0.33, 0.67, 1.

pattern of activity is given on the motor cell population, one may calculate the center
of gravity estimator (Equation 5.8) where it is understood that the scalar “preferred
stimuli” p̂i of each neuron are to be replaced by vectorial “preferred movements” p̂i .
Note that Equation 5.8 works just as well with vectorial quantities, since the activ-
ities ei always remain scalars. Georgopoulos et al. (1993) use the term “population
vector” for the center-of-gravity estimator of vectorial quantities. It turns out that by
means of this population vector, the actual arm movement is nicely predicted. If the
population vector is monitored over time while the monkey is planning a movement,
it can be seen anticipating this movement (“mental rotation”).
Fig. 5.7 shows the general situation for a two-dimensional parameter space. The
tuning curves (or motor fields) are bell-shaped curves in that space, centered at
the cell’s preferred parameter value p̂i ∈ IR2 . The interpolation between two such
preferred parameter vectors is realized by decreasing activity in one channel and
increasing activity in the other. By means of the center-of-gravity calculation, this
amounts to an interpolation of the population estimate. If the two preferred stimuli
are different pointing directions, the described shift of activity from one channel to
another amounts of a rotation of the population vector.
Two- or higher-dimensional tuning curves are are generally modeled as “radial
basis functions” (RBF):
ρi (p) := f (p − p̂i ) (5.10)

where f is a monotonically decreasing function on IR+ o such as the Gaussian (Pog-


gio & Girosi 1990). In the figure, the tuning curves are symbolized by circles. If the
preferred stimuli (RBF-centers) are suitably distributed in parameter space, contin-
uous variation of the parameter values can be represented by the relative activities
in the population.
5.2 Retinotopic Mapping 123

5.1.5 Summary
Population coding is an ubiquitous scheme of neural coding which has a number
of characteristic properties, which play also a role in neural information processing.
Some of these properties are:
• Information content: Population codes can convey more information than non-
overlapping labeled lines. The most compelling example is color vision where
three channels can code for some two million colors, differentiated in hue, satu-
ration and brightness.
• Hyperacuity (sub-pixel resolution): Population codes inherently allow for a res-
olution better than the channel spacing. Hyperacuity in visual line localization is
a clear example.
• After-effects and the adjustment of working-ranges: Since individual parameter
values are coded by the balance of activity in overlapping channels, adaptation
of individual channels leads to distorted percepts. Overall, aftereffects tend to
emphasize contrasts, rather than static situations. If the average value of a sensory
parameter changes, the system can adjust its working-range.
• Interpolation and mental rotation : Reading population codes by the center of
gravity estimator implies an interpolation operation. Examples include the cod-
ing of arm movements in cortical motor neurons, eye-movements to multiple
targets, both in the fronto-parallel plane and in depth, and the hippocampal place-
field code to location. The different phenomena described as mental rotation (cor-
relation of reaction time and turning angle in same-different judgments, neural
representation of arm movements) can be interpreted as direct consequences of
the interpolation property of population codes.

5.2 Retinotopic Mapping


In the visual cortex, information from the different parts of the retina is system-
atically ordered, resulting in what is called a retinotopic map or retinotopic rep-
resentation. This map can be charted by recording from cortical neurons and then
determining which location on the retina delivers peak stimulation to each cortical
location. The cortex of primates has a large number (> 20) of areas each one of
which contains a more or less retinotopic image of the retina.
Here we will briefly discuss models for retinotopic maps given in terms of map-
pings or coordinate transforms from the retina to the visual cortex. We denote such
mapping by curly capital letters:

R : IR2 → IR2 , R(x, y) := (u, v). (5.11)

The retinal and cortical coordinates are given by the two-dimensional vector (x, y)
and (u, v), respectively.
124 5 Coding and Representation

b 
*
7

Fig. 5.8 The area of a parallelogram. A parallelogram may be de-  
scribed by the vectors a = (a1 , a2 ) and b = (b1 , b2 ). After divid- 
Z  
ing it into two triangles along the diagonal a +b, the surface area a  Z
  
Z
may be calculated as the base times the height. The length of the   
base is a +b. The normal vector in the direction of the height is   
(−a2 − b2 , a1 + b1 ) /a +b. The height may be obtained by pro-    
jecting a onto this normal. Then the surface area may be calculated:  
A = |a1 b2 − a2 b1 |. 



5.2.1 Areal Magnification


An important issue in retinotopic mapping is areal magnification, i.e. the cortical
representational area devoted to some patch of visual field or retina. In this section,
we derive an expression for the magnification of a transformation between two-
dimensional areas.
We first note that the surface area of a parallelogram defined by the vectors
(a1 , a2 ) and (b1 , b2 ) (cf. Fig. 5.8) is
% %
% a b %
A = |a1 b2 − b1a2 | = %% det 1 1 %% . (5.12)
a2 b2

The areal magnification of a linear mapping L,



x a11 a12 x a11 x + a12y
L = = ,
y a21 a22 y a21 x + a22y

can be determined by examining the square defined by the vectors a = (1, 0) and
b = (0, 1) in the domain of the mapping. The area of this square is 1. Its image is
the parallelogram defined by the vectors L(a) = (a11 , a21 ) and L(b) = (a12 , a22 ) .
According to the rule discussed above, its area is |a11 a22 − a12 a21 |. Since the area of
the original square was 1, the new area is also the areal magnification of the linear
mapping. For the linear mapping, this magnification is the same everywhere.
For a general (differentiable) mapping M : IR2 → IR2 , areal magnification can
be determined by first finding a local linear approximation and than calculating the
magnification of this linear mapping (cf. Fig. 5.9). A linear approximation (gray
grid in Fig. 5.9) can be obtained from the matrix of the partial derivatives of the two
components of the mapping M , that is, from the Jacobian matrix of M (see, for
example Rudin 1976):
⎛ ∂M ∂M ⎞
1 1
⎜ ∂x ∂ y ⎟
JM (x, y) := ⎝ ∂ M ∂ M ⎠ . (5.13)
2 2
∂x ∂y
5.2 Retinotopic Mapping 125

Fig. 5.9 The definition of areal magnification of a general transformation T . Left: the domain
in which T is defined. Right: The image of the grid as transformed by T (black) and the local
approximation of T in the vicinity of the center of the grid.

The linear approximation of a multidimensional mapping by means of the Jacobi


matrix is equivalent to the approximation of a one-dimensional function by a tangent
whose slope is given by the local derivative of the function. The determinant of the
Jacobi matrix is the local areal magnification produced by the mapping M .
% %
% ∂ M1 ∂ M2 ∂ M1 ∂ M2 %
%
| det JM (x, y)| = % − % (5.14)
∂x ∂y ∂y ∂x %

The reason for having different areal magnifications in the cortical representation
of the visual field is that the density of retinal ganglion cells is not constant. We
can assume that, at least approximately, each retinal ganglion cell maps to the same
number of cortical neurons covering a constant area of the cortical surface. In the
primate retina, the distribution of ganglion cells has a peak around the fovea and
declines towards the periphery in a more or less isotropical way. In order to obtain
equal representation in the cortex, the mapping must therefore be distorted, with an
expansion (high areal magnification) of the central retina and a compression (low
areal magnification) of the periphery. If dg (x, y) denotes the density of ganglion cells
and R represents the imaging function describing the retinotopic map, equal cortical
representation of all ganglion cells amounts to

dg (x, y) = c|det JR (x, y)|, (5.15)

where c is a constant and | det JR | is the areal magnification factor.


Equation 5.15 is a partial differential equation related to the so-called eikonal
equation of mathematical physics. It does not have a unique solution. Indeed, mul-
tiple solutions of Equation 5.15 have been used to model multiple retinotopic maps
in cortical areas V1, V2, and V3 (Mallot 1985).
126 5 Coding and Representation

5.2.2 Conformal Maps


The analysis can be simplified for a class of mathematical mapping functions known
as conformal maps. These functions are characterized by the property that small
circles are mapped to circles of variable size but not to ellipses. A related prop-
erty is that lines intersecting at right angles in the domain of the function will be
mapped to curves again intersecting at a locally orthogonal angle. Cortical retino-
topic maps are not generally conformal; however, conformality may be used as a first
approximation.
Mathematically, conformal mapping satisfies a pair of partial differential equa-
tions known as the Cauchy-Riemann equations. If C1 (x, y) and C2 (x, y) denote the
components (cortical coordinates) of a mapping function C , the Cauchy-Riemann
equations read

∂ C1 ∂ C2 ∂ C1 ∂ C2
= and =− . (5.16)
∂x ∂y ∂y ∂x
It follows immediately that the Jacobian of a conformal function describes a pure
rotation combined with a scaling, but no shear, in accordance with the mapping
properties described above.

AB
JC (x, y) = (5.17)
−B A

with
∂ C1 ∂ C2 ∂ C1 ∂ C2
A= = and B = =− . (5.18)
∂x ∂y ∂y ∂x
For cortical magnification we obtain

| det JC | = A2 + B2. (5.19)

Conformal maps can also be thought of as one-dimensional complex differentiable


functions, where the real and imaginary parts of domain and range are the two-
dimensional coordinates of retina and cortex, respectively. In this case, areal mag-
nification is simply the squared absolute value (modulus) of the complex derivative.
Since formally, complex derivatives behave just as real derivatives, the mapping-
magnification equation can now be turned into an ordinary differential equation.
As a result of conformality, the areal magnification will also be equal to the
square of the length magnification of a line segment (the “linear” magnification). In
monkeys,
 areal magnification has been shown to approximately equal 1/r2 where
r = x2 + y2 is retinal eccentricity. Eq. 5.15 then becomes:

1
= c(R  (r))2 (5.20)
r2
5.2 Retinotopic Mapping 127

for some constant c. From this, we immediately compute (see Fischer 1973):
1
R(r) = log r. (5.21)
c
This function will be discussed in the following section.

5.2.3 Log-Polar Mapping


Following the conformality assumption, wecan replace r in Equation 5.21 by the
complex number, x + iy = reiφ , where r = x2 + y2 and φ = arg(x + iy).
With Euler’s formula (Eq. 3.19) we then obtain:

P(x + iy) = u + iv (5.22)


= log(|x + iy|) e i arg(x+iy)
= log r + iφ .

r
φ
6
I
f

f

v
6

-u

Fig. 5.10 Log-polar mapping of the visual hemifield with offset. The inset on the left shows
the contralateral (right) visual hemifield with f = (0, 0) marking the center. The big figure
shows the image of the polar grid under the log-polar map (u, v) = Pc (x, y) where f  =
Pc ( f ) is the foveal representation. The representation of the vertical meridian delimiting
the visual hemifield to the left, is the ⊂-shaped curve. The semicircular perimeter maps to the
straight line to the right. Close to the fovea, the radial structure of the retinal grid is preserved.
Further to the periphery, the visual field radii become parallel lines and the iso-eccentricity
circles approximate straight verticals. This effect is increasingly pronounced if the value of c
is reduced. The straight line in the map, running from top left to bottom right is the image of
the logarithmic spiral marked in the visual field.
128 5 Coding and Representation

The cortical u-coordinate thus corresponds to the logarithmically compressed retinal


eccentricity, while the cortical v-coordinate represents the angle from the horizontal
meridian.
Equation 5.22 is known as the log-polar mapping. In the fovea itself, i.e. r = 0,
it is undefined since log(r) approaches negative infinity. Biologically, this reflects
our assumption dg (r) = r−2 which unrealistically implies that ganglion cell density
and therefore areal magnification in the retina be infinite. This can be mended by
adding a small constant c in the denominator, i.e. dg (x, y) = 1/((x + c)2 + y2 ) which
will prevent dg from growing beyond 1/c. We replace the complex numbers by two-
dimensional vectors (x, y) and (u, v), respectively, and obtain
1
log((x + c)2 + y2 )
Pc (x, y) = 2
y . (5.23)
arctan x+c

The retinotopic map Pc is shown in Fig. 5.10 by means of a polar grid in the right
visual field and the image of this grid in the left visual cortex, area V1. The plot
nicely models neurophysiological measurements of visual retinotopic maps in mon-
keys. Also shown is a straight line in the cortical representation together with a curve
in the visual field that will be mapped to this straight line under the log-polar map-
ping. Easy calculation proves that this curve is a logarithmic spiral. This is thought
to be the reason for the spiral structure of visual hallucination patterns known for
example from migraine “aura” events. In migraine aura events, a wave front known
as spreading depression is moving across the visual cortex, inducing a temporary
scotoma in the visual field. While the depression wave front is approximately a
straight line, the subjects perceive a spirally shaped scotoma, since they (as always)
interpret the visual field position of cortical activity via the inverse of the retinotopic
map (see Bressloff et al. 2001.

5.3 Suggested Reading


Books
Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley
Series in Telecommunications. John Wiley & Sons, Inc., New York.
Kriegeskorte, N. and Kreiman, G. (eds.) (2012) Visual Population Codes. Toward a
Common Multivariate Framework for Cell Recording and Functional Imaging.
The MIT Press, Cambridge, MA.
Rieke, F., Warland, D., de Ruyter van Steveninck, R., Bialek, W. (1997) Spikes.
Exploring the Neural Code. The MIT Press, Cambridge, MA.

Original Papers
Bressloff, P. C., and Cowan, J. D., and Golubitsky, M., and Thomas, P. J. and Wiener,
M. C. (2001) Geometric visual hallucinations, Euclidean symmetry and the
5.3 Suggested Reading 129

functional architecture of striate cortex. Philosophical Transactions of the Royal


Society London (B) 356:299 – 330
Hallucination pattern are explained as simple waves of activity on the cortex,
interpreted by the observer by “backward projection” with the retinotopic map-
ping function. The idea is extended to the pattern of cortical hypercolumns
which is hypothesized to give rise to pattern such as the zig-zag in migraine
“fortification” hallucinations.
Fischer, B. (1973). Overlap of receptive field centers and representation of the visual
field in the cat’s optic tract. Vision Research, 13:2113 – 2120.
Derivation of the logarithmic structure of retinotopic maps from the density
distribution of the retinal ganglion cells.
Georgopoulos, A. P., Taira, M., and Lukashin, A. (1993). Cognitive neurophysiol-
ogy of the motor cortex. Science, 260:47 – 52.
Demonstration of population coding in the motor cortex. This paper popularized
the ideas of population coding and the center-of-gravity estimator, for which the
authors introduced the term “population vector”.
Kay, K. N., Naselaris, T., Prenger, R. J., and Gallant, J. L. (2008). Identifying natural
images from human brain activity. Nature 452:352 – 356.
This paper defines a receptive field function for cortical voxels measured with
functional brain imaging. Using the methods of population coding, it then shows
to what extent visually presented images can be recovered from simultaneously
recorded brain activity.
Poggio, T., Fahle, M., and Edelman, S. (1992). Fast perceptual learning in visual
hyperacuity. Science, 256:1018 – 1021.
This papers presents a model of visual hyperacuity along the lines of popula-
tion coding and radial basis function networks. It then proceeds to show that
perceptual learning in human subjects can be simulated with the same model.
Polimeni, J. R., Balasubramanian, M., and Schwartz, E. L. (2006). Multi-area vi-
suotopic map complexes in macaque striate and extra-striate cortex. Vision Re-
search 46: 3336 – 3359
The idea of log-polar mapping the primary visual cortex area V1 is extended
to adjacent areas which can be modeled by similar functions, i.e. the negative
branch of the complex logarithm.
References

Adelson, E.H., Bergen, J.R.: Spatiotemporal energy models for the perception of motion.
Journal of the Optical Society of America A 2, 284–299 (1985)
Antolik, J., Bednar, J.A.: Development of maps of simple and complex cells in the primary
visual cortex. Frontiers in Computational Neuroscience 5(17) (2011)
Aidley, D.J.: The Physiology of Excitable Cells, 4th edn. Cambridge University Press, Cam-
bridge (1998)
Bressloff, P.C., Cowan, J.D., Golubitsky, M., Thomas, P.J., Wiener, M.C.: Geometric visual
hallucinations, Euclidean symmetry and the function architecture of striate cortex. Philo-
sophical Transactions of the Royal Society (London) B 356, 299–330 (2001)
Barlow, H.B., Levick, R.W.: The mechanism of directional selectivity in the rabbit’s retina.
Journal of Physiology 173, 477–504 (1965)
Blakemore, C., Sutton, P.: Size adaptation: a new aftereffect. Science 166, 245–247 (1969)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley & Sons, New York
(1991)
Eichner, H., Joesch, M., Schnell, B., Reiff, D.F., Borst, A.: Internal structure of the fly ele-
mentary motion detector. Neuron 70, 1155–1164 (2011)
Fischer, B.: Overlap of receptive field centers and representation of the visual field in the cat’s
optic tract. Vision Research 13, 2113–2120 (1973)
FitzHugh, R.: Impulses and physiological states in theoretical models of nerve membrane.
Biophysical Journal 1, 445–466 (1961)
Georgopoulos, A.P., Taira, M., Lukashin, A.: Cognitive neurophysiology of the motor cortex.
Science 260, 47–52 (1993)
Haykin, S.: Neural Networks and Learning Machines, 3rd edn. Pearson Prentice Hall, New
York (2008)
Hines, M.L., Carnevale, N.T.: The NEURON simulation environment. Neural Computation 9,
1179–1209 (1997)
Hodgkin, A.L., Huxley, A.F.: A quantitative description of membrane current and its appli-
cation to conduction and excitation in nerve. Journal of Physiology 117, 500–544 (1952)
Hopfield, J.J.: Neural networks and physical systems with emergent collective computational
abilities. Proceedings of the National Academy of Sciences 79, 2554–2558 (1982)
Hartline, H.K., Ratliff, F.: Spatial summation of inhibitory influences in the eye of limulus,
and mutual interaction of receptor units. Journal of General Physiology 41, 1049–1066
(1958)
Itti, L., Koch, C.: A saliency-based search mechanism for overt and covert shifts of visual
attention. Vision Research 40, 1489–1506 (2000)
132 References

Jack, J.J.B., Noble, D., Tsien, R.W.: Electric current flow in excitable cells. Clarendon Press,
Oxford (1975)
Jones, J.P., Palmer, L.A.: An evaluation of the two-dimensional Gabor filter model of simple
receptive-fields in the cat striate cortex. Journal of Neurophysiology 58, 1233–1258 (1987)
Kay, K.N., Naselaris, T., Prenger, R.J., Gallant, J.L.: Identifying natural images from human
brain activity. Nature 452, 352–355 (2008)
Koenderink, J.J.: Scale-time. Biological Cybernetics 58, 159–162 (1988)
Kohonen, T., Reuhkala, E., Mäkisara, K., Vainio, L.: Associative recall of images. Biological
Cybernetics 22, 159–168 (1976)
Lettvin, J.Y., Maturana, H.R., McCulloch, W.S., Pitts, W.H.: What the frog’s eye tells the
frog’s brain. Proceedings of the Institute of Radio Engineers 47, 1950–1961 (1959)
von der Malsburg, C.: Self–organization of orientation sensitive cells in the striate cortex.
Kybernetik 14, 85–100 (1973)
Mallot, H.A.: An overall description of retinotopic mapping in the cat’s visual cortex areas
17, 18, and 19. Biological Cybernetics (1985)
Minsky, M.L., Papert, S.A.: Perceptrons, expanded edition. The MIT Press, Cambridge
(1988)
Murray, J.D.: Mathematical Biology. I. An introduction, 3rd edn. Springer, Berlin (2002)
Mallot, H.A., von Seelen, W., Giannakopoulos, F.: Neural mapping and space–variant image
processing. Neural Networks 3, 245–263 (1990)
Milescu, L.S., Yamanishi, T., Ptak, K., Smith, J.C., Mogri, M.Z.: Real time kinetic modeling
of voltage-gated ion channels using dynamic clamp. Biophysical Journal 95, 66–87 (2008)
Oja, E.: A simplified neuron model as a principal component analyzer. Journal of Mathemat-
ical Biology 15, 267–273 (1982)
Poggio, T., Girosi, F.: Regularization algorithms for learning that are equivalent to multilayer
networks. Science 247, 978–982 (1990)
Pollen, D.A., Ronner, S.F.: Visual cortical neurons as localized spatial frequency filters. IEEE
Transactions on Systems, Man, and Cybernetics 13, 907–916 (1983)
Rall, W.: Electrophysiology of a dendritic neuron model. Biophysical Journal 2, 145–167
(1962)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back–propagating
errors. Nature 323, 533–536 (1986)
Rudin, W.: Principles of mathematical analysis, 3rd edn. McGraw-Hill, New York (1976)
Schölkopf, B., Smola, A.: Learning with Kernels. Support Vector Machines, Regularization,
Optimization and Beyond. The MIT Press, Cambridge (2002)
Tuckwell, H.: Introduction to theoretical neurobiology (2 Vols.). Cambridge University Press,
Cambridge (1988)
Index

absolute value, 66 contrast, 61


action potential, 3, 34 convolution, 30, 33, 44, 63
all-or-nothing behavior, 13 commutativity of, 30, 80
activation dynamics, 83 convolution theorem, 72, 79
activation function, 85 correlation, 26
adaptation, 121 correlation detector, 51
after-effect, 120 correlation theorem, 80
amplitude, 64 covariance, 92
argument (of a complex number), 66 covariance matrix, 103, 104, 107
associative memory, 100 cross-correlation, 51
auto-correlation, 51, 78
DC component, 71, 75
back-propagation, 94, 98
decision boundary, 90, 92
bit, 114
delay, 51
box-function, 71
δ -Impulse, 27
bug detector, 28, 91
δ -rule, 97
depolarization, 5
cable equation, 17
differential equation, 7
capacitance, 5
analytical solution, 7
causality, 33, 39
numerical solution, 12
center-surround organization, 36
center of gravity estimator, 117 partial, 17, 125, 126
central limit theorem, 40 dot product, 86, 92
channel code, 114
classification, 89–100 eigenfunction, 63
coefficients (of Fourier series), 71, 75 eigenvalue, 63, 109
coincidence detector, 50 eigenvector, 109
complex cell, 46, 48, 53 energy model, 47
complex exponential, 67 entropy, 115
complex number, 66, 126 equilibrium potential, 3
conditioning, 88 error minimization, 94, 105
conformal map, 126 Euler’s formula, 67, 127
content addressable memory, 105 even function, 38
contour edge, 45 excitability, 3
134 Index

feature space, 89 limit cycle, 15


FitzHugh-Nagumo equations, 13 line spectrum, 77
fixed point, 14 linearity, 64
Fourier series, 71 of a receptive field, 24
Fourier transform, 77 of classification, 90
Fourier analysis, 57–81 local field potential, 114
frequency, 71 log-polar mapping, 128
frequency plane, 79 logical neuron, 84
function approximation, 117 low-pass, 69, 72, 75
functional, 41
Mach band, 31
Gabor function, 37, 46 magnetic resonance imaging, 61
Gaussian function, 35 magnification, 32, 124
difference of, 36 matched filter, 28, 91
Fourier transform of, 68 matrix, 87, 101
gradient descent, 95 maximum likelihood estimation, 118
grating, 61 migraine aura, 128
modulation transfer function, 68, 74
half-wave rectification, 42 Moore-Penrose pseudo inverse, 105
harmonic, 60 motion detection, 49, 53
Heaviside function, 44 motor field, 121
Hebb rule, 88, 102, 106 musical chord, 59
normalizing, 108, 110
hidden unit, 92, 98 node of Ranvier, 18
high-pass, 72, 75 non-linearity, 8, 40
Hodgkin-Huxley theory, 4–13, 19 interaction, 45
holographic memory, 106 static, 42, 85
Hopf bifurcation, 15 norm of a vector, 86
hyperacuity, 119
hyperplane, 100 odd function, 38
Ohm’s law, 16
information theory, 114 Oja learning rule, 107
inner product, 79, 86 operator, 27, 63
intensity code, 114 optimal stimulus, 28, 91
ion distribution, 2 for motion, 53
orientation, 39
Jacobian matrix, 124 orthogonality, 86
of sinusoidals, 75
kernel, 27, 100 outer product, 103
Kirchhoff’s rules, 16 overlearning, 98, 99

labeled line code, 50, 114 parity problem, 93


Laplacian, 31 partial derivative, 95, 124
lateral inhibition, 28 passive conduction, 15, 18
learning pattern recognition, 89
competitive, 106 perceptron, 89
supervised, 94 multi-layer, 92, 94, 98
unsupervised, 88 peri-stimulus time histogram, 34
learning rule, 88, 94 periodic function, 71
length constant, 18 phase, 64, 66
Index 135

phase portrait, 13 spatial frequency, 38, 39, 61


phasic behavior, 9 spatio-temporal, 33
phasor, 67 Gabor function, 39
plane wave, 38, 61, 78 receptive field, 53
point image, 29 separability, 34, 37
point spread function, 29, 75 spectrogram, 60
population vector, 122 spectrum, 58
population code, 114 spike rate, 34
interpolation, 122 squaring non-linearity, 44
potassium channel, 7 stationarity, 33
power spectrum, 78 steady state, 14
predicate, 89 step edge, 45
principal component analysis, 109 superposition, 24, 40, 41
projection, 87 support vector machine, 99
synaptic weight, 85
quadrature pair, 47
threshold, 42
radial basis function, 122
time constant, 6, 36
rate constant, 9
transfer function, 85
receptive field, 23–28
translation invariance, 29, 31, 64
receptive field function, 26
and non-linearity, 42
isotropic, 35
transposition, , 84
oriented, 37
tuning curve, 35, 50, 115, 118
refractory period, 15
relaxation, 6
response profile, 26 uncertainty relation, 69
resting potential, 2, 3
retinal ganglion cell, 23, 35, 125 vector, 84
retinotopic, 123 visual cortex, 37, 53, 123
reverse correlation, 23 voltage clamp, 6, 7
Volterra-kernel, 45
saturation, 42, 44
scale, 35, 39 wave equation, 19
self-organizing feature map, 109 wave number, 71
side lobes (of a Gabor function), 39 wavelength, 71
sigmoidal, 44 wavelet, 38
simple cell, 47 weight dynamics, 83, 88
small signal behavior, 42 winner-unit, 110
sodium channel, 9
space-clamp, 11 XOR-problem, 92, 100

You might also like