Book

Download as pdf or txt
Download as pdf or txt
You are on page 1of 269

INDEX

S.NO TOPICS PAGE.NO


Week 1
1 Introduction to the Machine Learning Course 4

2 Foundation of Artificial Intelligence and Machine Learning 12

3 Intelligent Autonomous Systems and Artificial Intelligence 16

4 Applications of Machine Learning 20

5 Tutorial for week01 25

Week 2
6 Characterization of Learning Problems 27

7 Objects, Categories and Features 33

8 Feature related issues 39

9 Scenarios for Concept Learning 43

10 Tutorial for week02 49

Week 3
11 Forms of Representation 51

12 Decision Trees 57

13 Bayes (ian) Belief Networks 60

14 Artificial Neural Networks 65

15 Tutorial for week03 71

16 Genetic algorithm 73

17 Logic Programming 78

Week 4
Inductive Learning based on Symbolic Representations and Weak
18 Theories 82

19 Generalization as Search - Part 01 86

20 Generalization as Search - Part 02 92

1
21 Decision Tree Learning Algorithms - Part 01 97

22 Decision Tree Learning Algorithms - Part 02 103

23 Instance Based Learning - Part 01 109

24 Instance Based Learning - Part 02 117

25 Cluster Analysis 124

26 Tutorial for week04 131

Week 5
27 Machine Learning enabled by Prior Theories 133

28 Explanation Based Learning 140

29 Inductive Logic Programming 147

30 Reinforcement Learning - Part 01 Introduction 153

31 Reinforcement Learning - Part 02 Learning Algorithms 160

32 Reinforcement Learning - Part 03 Q - Learning 166

33 Case - Based Reasoning 168

34 Tutorial for week05 172

Week 6
35 Fundamentals of Artificial Neural Networks - Part1 174

36 Fundamentals of Artificial Neural Networks - Part2 180

37 Perceptrons 187

38 Model of Neuron in an ANN 191

39 Learning in a Feed Forward Multiple Layer ANN - Backpropagation 198

40 Recurrent Neural Networks 202

41 Hebbian Learning and Associative Memory 207

42 Hopfield Networks and Boltzman Machines - Part 1 211

43 Hopfield Networks and Boltzman Machines - Part 2 219

44 Convolutional Neural Networks - Part 1 227

45 Convolutional Neural Networks - Part 2 233

46 DeepLearning 236

47 Tutorial for week05 248

2
Week 7 259

48 Tools and Resources


259

49 Interdisciplinary Inspiration

Week 8

50 Preparation for Exam and Example of Applications 259

3
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 1
Introduction to the Machine Learning Course
Dear Students,
Welcome to this NPTEL course on machine learning, I'm Carl Gustaf Jansson, professor at KTH.
I'm responsible for developing and running this course. This is the first lecture of the first week
of this course and the purpose of this lecture is to relate machine learning to some other
important relevant areas, but also to give you an overview of the content of the course. The
purpose of this course is to give a broad introduction to machine learning, as being a subfield of
artificial intelligence and computer science. The course is not aimed to be a professional course
in the sense of giving hands-on knowledge on specific algorithms, coding in specific languages
and apply to specific problems. You can learn many such alternative courses. Machine learning
has imported much relevant knowledge from statistics and probability theory. This is not a
general course on these fields either. This course will hopefully make you better computer
scientists, who are able to handle data analysis problems but it will not make you a general
statistician or data scientist. Even if the focus of this course is machine learning in the context of
artificial intelligence I would like to say a few words about some neighboring areas like
statistics, data science and big data. And first of all on the slide you see now you can get a little
overview of the situation of the coming slides, I will say a few words about the different areas.
So essentially today case against something happening and actually some involves in machine
learning get a better spreading and also a new area emerged which is often called big data.
Actually based on the growth, rapid growth of data in the world with of course of a great need to
get this and a lot data analysed and I would say that these two things together triggered some
urge to widen the concept of statistics somewhat within you were a traveller concept and term
data science. So let's first turn to the area of statistics the oldest area in is stronger which has a
very long history. As statistical concessional doubt with a range of activities concerning data
from data collection, data harvesting, data modelling up to data presentation for decision-makers
and in the middle there are data analysis and of course the big overlap between machine learning,
statistics is in this middle, data analysis part. And of course in particular what is very useful and
needed in machine learning is input from the traditional area of statistics with respect to
mathematical statistics which is the core of data analysis which in turn is based on probability
theory. So a lot of input has come this way into machine learning. Also one can say that there are
different kinds of statistics or the descriptive statistics who only concern with really a good way

4
describing examples-statistical examples taken, without any ambition to draw any wider
conclusions from it from inferential statistics where the ambition is to really draw wider
conclusions on whole populations, and of course machine learning is much more dependent on
call inferential statistics. If we turn now to data science this area doesn't present much new, it
was launched as an umbrella concept for statistics actually augmented with some other data
analysts, analysis techniques developing within the big data field and with advances from
machine learning and as for statistics data science cover also the range of activities from
harvesting up to the visualization appreciation and with the same old overlap in the data analysis
stage. Actually the concept was coined around 1996-98 which is like 20 years ago but then it was
really boosted in this last decade and it got a lot of publicity in various places, an example of that
is this very special article from the Harvard business review called the “Data scientist: the sexiest
job of the 21st century” which contributed a lot to this new areas fame and of course many young
people looked for jobs and also of course the relevant education in this new area. In contrast to
data science, the term big data primary refers to storage, maintenance and access to data even if
the borderline the design is some rather blurred. The big data area is based on more traditional
areas, one is very large databases others are data warehousing and distributed databases. There
was born in late 90’ies right after the fast data growth due to explosion development or harvested
sensor data as well as data from the web the parallel growth of traditional databases in a variety
of sectors. We speak about big data, we talk about terabytes at least 10 upto 12 and already we
talk about zettabytes 10 upto 21, but the size is not the only relevant parameter at the various
variety of data types, the quality, velocity of generation are also important factors to increase the
complexity. This was an excursion in some neighboring areas. So now we turn back to machine
learning and an overview of this course and as you have understood we have eight weeks and the
first three weeks have an introductory character, so the second week will be a more abstract
description of a characterization of learning problems and the third week will focus on different
kinds of representations. The motivation for having this kind of part of the course is that all kind
of learning have to take place and within some kind of representation and therefore it's crucial to
have some knowledge about those kinds of representations that are most common in artificial
intelligence. If you already have a good view good knowledge on that, of course you can go very
fast on that part of the course. The three middle weeks of the course four to six are kind of the
core parts of the course, so two weeks focus on learning in symbolic representations the first
only inductive learning in situations where you have no theory or rather weak theory, that backs
up the learning process while in the second week we will look at situation where we really have a
pretty strong theory in order to maintain theme of the domain you're working in sense learned on
the border what is already now. Week six is done as a contrast learning in a sip sub sub symbolic
representation particular to issue the URLA code and of course also here there is a distinction
between situation where you have some rest prior theory or not that it doesn't seem relevant to
divide that up in the same fashion as for wait for a week for it. So finally week seven and eight
focus on goes in two directions, actually week seven looks into the context of cognitive science
even if in theory machine learning algorithm could just be pure computer science algorithm,
normally the way one design this kind of techniques is based upon some intuitions and some

5
knowledge and some inspiration from the way natural systems works to contrast to artificial. So
this week we'll look into some models of learning from various cognitive sciences that has very
much inspired the way we do much machine learning. Finally week seven and it gives an
overview of the current very vast set of tools and resources that are available at this point for
doing working machine learning, it's not an easy area of it because of the current very strong
industrial interest which promotes the development of tools and resources very much. So
therefore the hope is to bring some order and some overview or what happens in this area. This
slide is aimed at giving you a clearer figure out the order of how one should follow this course, I
mean of course the bottom line is always possible and meaningful and may be advisable to go
week by week, from week one to week eight and but as I understand there is not always as strong
preconditions anywhere. So i just want to say a few things here was a that week one week two
then week three are strongly advisable to follow in the beginning, because they give introduction
to the area and the abstractions and the general knowledge that is supposed to be a background
for that. And then a week four, five and six day they could be taken in different orders but still as
this course is set up it's more favorable to even following week four, five and six and in the given
order because many things that turn up in week six in the some symbolic representations in a
way mirror somehow what really happens in this symbolic case in week four and five and as it's
now set up is treated maybe more in detail and more thoroughly during this week's. What is more
independent is of course week seven and eight because they are extensions of the course in a
more theoretical and more background direction or in a more practical direction, so of course you
could probably feel free to look into those whenever you like during your work with the course.
So now I'm going to turn to the individual weeks of the course so my intention is to shortly
comment and the content each of the week to give you a flavor and a feeling for what you can
anticipate to come and during the course. I will not go into depth, I will make just a few
observations and comments for each part. So this week we will focus mostly in the coming two
lectures on artificial intelligence, we'll talk both about the roots of the area sixty years ago and
what's happened then and how the profile of the area looking at that point and then I will make a
more general overview of the areas as such and place machine learning within the whole area
finally there will be a lecture about applications of machine learning and theoretically that lecture
could be placed anywhere within the course but I've chosen to put it here very early because
whatever technique we were going to talk about it will always refer to some applications so
therefore it's good to have some reference point of this kind early in the course, and finally I
want to comment on the general structure of each week so every week will you have an
introduction area lecture and which sometimes could be a little short or sometimes a little longer
and then there will always be a last lecture which we call tutorial and tutorial here means
preparations and introduction of the assignments for that week. So in the second week of the
course the goal is to characterize the various situations which machine learning tries to tackles
and I'm going to be very general that a majority of machine learning takes want to tackle
situations where you have a lot of set of examples and you study those examples and from those
examples you want to be able to classify new examples or build some abstractions. So while on
the lecture is trying to sort out what is an example, what is an abstraction, how can you

6
characterize an example and so on. Then of course there are different modes of working here so
one way of working is what we call supervised where all examples are pre classified so for
example we say and let's feed the system with a number of images of a cat and all the examples
are cats and overseeing that a system do is to build up an abstraction of a cat. The contrast to that
is unsupervised which means that we feed the system with any image let's say any animal or any
category, and then it's up to the system to sort out what is a cat, was its dog what is something
else which is a more difficult task. Also one can play with the way you use the examples so you
if you want to supervise better or more you can say these are positive examples, these are
negative examples, these are extreme examples, these are typical examples which also facilitate
or simplify the learning process. And our distinction that is important is whether you have all
examples they are existing in the beginning or whether all the examples come one at a time in a
continuous fashion, the later is also more difficult to handle and another aspect is whether you
learn and let's say in the blue so only thing that they say says your examples and you build a
small theory of abstractions from that the other extreme means you already have a strong theory
a really strong knowledge base which you can already use for problem solving, and then what
you want to do is to improve that knowledge base which means that you learn on the border of
prior knowledge. Yeah what an example of that is what call reinforcement learning which is a
special case for examples it used a lot for example for training robots it's an example, where you
want to learn a robot to move the arm in an optimum way by of course have to start with some
basic movement schemes so there is a series for moving but you can improve the movement by
adopting a number of parameters but of course this means that you have to give the system
feedback all the time on good movement bad movement and so on and this is a specific kind I
would say a strong theory learning. Finally the most ambitious thing here is to learn
representation and by that it means that normally machine learning you have a fixed
representation you have a fixed model of the domain you're working with then you always learn
within that model, however more and more people understand that it's not very realistic because
systems have to be robust and you cannot think about everything from the start so therefore,
there is also need for the system to learn the representation themselves learn which kind of
objects which you can see there which kind of features we could see there and so on, which is
then another step ahead in the advancement on the field. So the third week will be about various
representations used for machine learning and one can say there are two parts here first we can
look at the typical symbolic representation that additionally being used as the basis for building
learning algorithm and to such a decision trees and bayesian networks also one can add and that
will also be referred to there is this we can work simply with arrays of different dimensionality
which is the third symbolic way of handling this. And the other group are sub symbolic and that's
the typical type of representation in artificial neural network and that we will spend some time
on, we also mention another kind of simple symbolic representation genetic algorithms we will
not spend so much time on that on the course here at all but we will mention it and now and then
because it complements the picture yeah. Finally I would comment on a slight complication we
have here because of course in artificial intelligence we used we want to use machine learning to
improve our programs or our problem-solving programs and actually most systems built

7
historically in artificial intelligence are based on multi-paradigm programming a kind of
combination of logic programming, functional programming, object-oriented programming and
of course logically it would be so that when we learn we want to modify those programs. But
slightly contradictory here most work in machine learning have been focused on the earlier forms
of representation not them so much about learning functional logic and object-oriented program
as such, but of course it's important you know about this contradiction and it's also important to
know the properties of these representations because in the end results from learning processes
had to be fed into thee the existing systems. Now we turn to the core part of the course three
weeks where we will look at different kinds of algorithms, that has been important in the
development of machine learning and in the first week, week four we will look at inductive
learning algorithms that work on symbolic representations but also primarily with in cases where
we are weak theories in the sense that we are almost none or very little prior knowledge that
guides the learning process so the first case here is supervised learning this is the classical case
when we learn from pre classified examples and build up a well-defined set of defined
abstractions. The opposite case is unsupervised learning where we have non classified examples
and this algorithms themselves have to sort and categorize these example and build the
appropriate abstractions. The intermediate case here we will look at is called instance based
learning, in the two first cases the algorithms build concrete explicit abstractions out from the
examples and store these abstractions while in the last case we have a system philosophy where
we won't store all examples and do not build any abstractions that means of course we have to
have a structure of examples and when new examples come in we have to relate them to the old
and place them in the structure. So in week 5 we will still look at learning a symbolic
representation but look at a few cases where we combine the inductive learning with the
existence of strong theories and the first case I want to mention is inductive logic programming. I
already mentioned when we talked about representation that it was a kind of contradiction that
many machine learning algorithm worked in representations that were not the most important
and common ones for really building systems in artificial intelligence. Logic programming is one
of the very strong paradigms that has been used to build many systems in artificial intelligence
and 20 years ago the idea come up that why not we engineer our inductive algorithms so that
they can directly learn logic programming elements, and the first step is of course only to mimic
what one did earlier in the other form of representation but of course the next step of is that
because we now know now do everything in the same formalism we can more easily combine
prior knowledge or prior theories expressed in logic programming with the new kind of
abstractions that we built up from the inductions of examples and so this is essentially the key
idea logic programming. Another aspect is of course that for learning in cases with no prior
theory we need often many examples because many examples may be bad, may have noise may
be typical or not typical and so on, so this means that in order to play at there's noise and
different of my examples we need many examples, however we could also manage to learn from
fewer examples but then in a way we have to make a sanity check of the examples and screen the
examples in the base in the light of the prior knowledge, so this means that if you know
proprietary existing strong theory we can validate the soundness of our examples in the light of

8
that theory and thereby we could still utilize induction from these new examples even if there are
few. There are few other things that will be talked about in this week we will talk little about
learning by analogy which essentially we have two domains and we have a knowledge base or a
theory in one domain and we have a much weaker theory in another domain but we can assume
that there are similarities between two domain so therefore we can infer abstractions in the
second domain from the already existing one in the first. Finally there is also a need in many
kinds of system to explicitly and conveniently directly add new abstractions not having to
reprogram the whole system, but in this case no induction takes place its just that new knowledge
pieces are easily added to the system that was referred to by the term being told. And finally we
have reinforcement learning where we essentially also have a prior theory for example a prime
theory for guiding the moment of a robot but then we want to optimize the functionality of that
example that example of robot by incremental learning. So week six and this course covers
machine learning algorithms in artificial neural networks and there is a history of this that no one
can say that that the structure of lectures is based upon this history so we start by talking about
something called perceptron which was a very early attempt of an artificial neural network
people were very positive in all already in that 1960s about the successes one could have using
this technique however there was a backlash people showed that the original designs wasn't
really possible to go forward with and then one can say it was a big long period where not much
happened, so but in 1986 the like 20 years later this area revived again and there were some key
results in the early 1980s which showed that networks with many layers actually the perceptron
has just typically a single layer initially but if we build networks with many layers, then and
special kinds of propagating values forward and backwards in these networks, one could show
successful results. And so this happens in a 1980s and we will talk about what happened in that
period. The next step in this process is that in order to really come forward and solve real
problem with this technique you cannot have general networks you have to anyway adopt the
networks so they fit certain kinds of issues and certain kinds of problems, and image recognition
has been of the one of the key application areas and there are special techniques that we will
cover, an example is convolution networks which has specific properties that makes them
suitable for handling each problems in image recognition settings. Also the initial networks was
not ideal for modeling temporal phenomena like sequences of sequences of events and in order
to handle that another kind of network type appeared called recurrent networks. And these two
kinds of networks had a large importance for the development and we've also worked further
looking at more of these kinds of recurrent networks examples of that and field networks and
boltzmann machines. So since the 90’s there have been a continuous development of this but I
would say the next very strong revival of this area was once in the in the early 2000’s where this
area blossom and essentially the term used today for the state of the art of this technology is
essentially deep learning, I mean deep learning was coined a long time ago and the reason for the
for the name is that the successful neural networks need to have many layers that have to be deep
in a sense, but today of course this term covers a lot more because it encompass many of the
recent advances or the algorithmic side, and I would say all the one and very important
characteristics of what you can do today in deep learning is to take the kind of last step in the

9
learning problem which is to learn representations that is not only learn within a fixed model, but
really incrementally modify and develop the model itself.
Week seven of the course and one can say it is a separate part of this course the reason why is
there is to really show to you that machine learning advances have not happened in isolation as
just parting to computer science many times the designs have been inspired by insights fetched
from other areas and which one can characterize as cognitive science and part of cognitive
science are of course cognitive psychology, neuroscience, tribology, linguistics and so on and
then during this week we will primarily look at three things, we will look at some models in
cognitive psychology, we have to do how we learn categories as humans we conceptualize the
world but also modern cecum named cognitive psychology that describes how we as humans
solve problems and how we develop expertise so that's one thing another thing we will locate is
some model from your science which is actually models primarily we will look into how our
human perception works. And the third thing we will look at it during this week is looking at
theory of evolution because in a large scale evolution is a learning process and one can say on an
abstract level so they think things we will look into in two week 7. Finally for week number
eight we will talk about tools and resources useful for doing machine learning work when an area
is new, there are no most master special tools support systems available but when area comes
into the limelight, as machine learning of time today suddenly there is a great interest for all
kinds of tool and language provider to see to that that area is supported by their tool or language.
So today there is explosion of initiatives of various kind to support work in machine learning and
of course this also has influenced the success of the area lately. So the goal of this week is to
really try to sort out for you what kind of tools are they or what kind and so on because, it is a
kind of jungle today and so not so easy to understand what is what and so in this week we will
look at four categories of things. So the first category is named general technical computing tools
and these are general toolboxes existed for a long time. A typical example of such a toolbox is
Mathematica and which traditionally supported a lot of things but it didn't support machine
learning but today does so this is an example of a kind of tool but of course others in the same
category and we will go into that. And the next category is term here general support software
support which means all kinds of languages, support systems, libraries and so on that can help
you to work in this area. And just take a very simple example so for example language like
python is very popular today, first of all because it's a very commonly used language very
popular language to use for programming but also that when I made some effort to easily
integrate function in that language that facilitate working machine learning. The third category is
more dedicated platform that's been developed specifically for this purpose, for the purpose of
supporting machine learning. And there are a lot of companies who develop those, one well-
known example there is for example google that has developed a platform called tensorflow that
is used in many cases but there is there is really a large set of similar platforms. Finally
something which is very important is that there are easily accessible and openly available
datasets which you can test your systems and algorithms on and this also now starting to emerge
a large set of such data sets well knows that data sets it's image net which supplies a large
number of images for all kinds of common nouns. So before we end this introductory video want

10
to say a few things about literature, so first of all I want to say that this course has no dedicated
course literature, not one single book that that we follow. There are so many books in machine
learning and as you have understood hopefully now the area is pretty old, so there are very old
books there are very new books and there are very the books that are like in-between. I would
not recommend you at this point to read the very old books those from 1980’s and so. I would
recommend the new ones because some of them are very good, some of them are always
bestsellers at least within the context of an academic era like this, but of course the new ones you
probably have to buy because you may maybe significant evil burrow of them is absolutely not
possible to access them on the way but for the books in between and then I given you an example
here the book machine learning by Tom Mitchell it was originally published in1997. But there
are also later editions of course but for this kind of book older editions may be accessible of what
because they are not bestsellers anymore so there are no interest to the to the publishers. So for
me maybe like a default book that could be useful you want to read more about the area is this
book machine learning by Tom Mitchell. Newer books I also given you two examples here there
is a book from 2014 which is a general introduction by Ethem Alpaydin but there also a book
where much talked about to her this deep learning book by Yoshua Bengio and his associates.
For this course on every week you will get a number of literal references but this literal reference
will typically not be books and they will typically not be survey articles either. So maybe a
special feature of this course is when I give you the literal references in the various weeks I will
mostly give you reference to original articles, and or articles that represent the kind of
breakthrough in this area at various periods and various point in time. So this is the end of the
introduction thanks for your attention, the topic of the next lecture will be the foundation of
artificial intelligence and machine learning.

11
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 2
Foundation of Artificial Intelligence and Machine Learning

Welcome to this second lecture of the first week the machine learning course. The theme
for this lecture is foundation of artificial intelligence and machine learning. The purpose of this
lecture is not just by a single phrase or a sentence but through a more substantial narrative
convince you that artificial intelligence is not a new hype, it's a old established area sub area of
computer science which has a long history and from the beginning when this area was founded
also that machine learning already at that time formed an integral part of the area. So artificial
intelligence as a research area has 62 year old roots. It was established already in 1956 where a
small group of researchers gathered at Dartmouth College in New Hampshire US. It was just a
summer workshop during six to eight weeks that year and as you make an infer this happened
only 10 year or less later than the advent of the new of the first computer. So why did this
workshop take place at this point in time, so short time after the birth of computer science in
general? Part of the answer could be that the bulk of the work in the early days of computer
science became focused on very low-level matters where when try to apply the new computer
tools on very standardized tasks. In contrast to that a lot of the people who were instrumental in
the development of computer science already from the beginning had other ambitions and other
visions and that they probably very strongly felt that what they really intended with the new
computational devices didn't really happen, not the exciting things they wanted to happen but
rather applying the new computers to very much day-to-day use for standardized things. So
probably after this decade of preliminary work they thought that maybe it's time to raise the flag
again, to raise the flag for the original ambitions for the use of computers and essentially that the
potential of computers is to do very advanced things and that this should be put up very high on
the agenda again. So at this workshop the group of researchers who gathered there they managed
to make a defining statement for the area for artificial intelligence which they agreed on and that
statement, I will go through together with you because I think it's important statement and a lot
of what is said there still holds so the statement starts ‘the study is to proceed on the basis of the
conjecture that every aspect of learning or any other feature of intelligence can in principle be so
precisely described that the machine can be made to simulate it’ one part of this first part of this
of the statement that is good to observe on a course on machine learning like this is that the first
example given of these kinds of features of intelligence that one want to mimic and simulate is

12
the aspects of learning, so that statement continues ‘an attempt will be made to find how to make
machines, use language, form abstractions and concepts, solve kinds of problems now reserved
for humans and improve themselves’ so similarly in this course it's interesting to observe that our
two items on this in this list that relates to learning first number two which is to form abstractions
and concepts and number four improve themselves so even in the second paragraph of the
statement learning takes a major role the statement ends with a kind of funny paragraph which
said it goes as follows ‘we think that the significant advance can be made in one or more of these
problems if carefully selected group of scientists work on it together for a summer’. It's kind of
funny because after 60 years we haven't really made significant advance we have made some
advances but still significant advance would say we still wait for but the ambitions at this point
in time was very high in this group. So now let's take a look at the people who met at this
summer school. they have called be called the founding fathers of artificial intentions one of
them Claude Shannon is the founder of information and communications he and he published his
seminal work on that in the late 1940s he's also you may know known for having published and
written the most famous master's thesis in the history of computer science which was a precursor
for his later publications in information and communication theory. DM Mckay, was a well-
known British researcher who worked already at this point worked on the borderline between
information theory and cognitive science. Julian Bigelow, was the chief engineer for the
phenomenal computer at Princeton in 1946 the IAS computer or as is more popular it was called
‘the maniac’, Nathaniel Rochester, author of the first assembler and a key person in the
development of the first commercial computer IBM 701. Oliver Selfridge was named the father
of machine perception for his very early work on trying to automate processes similar to the way
the human vision system works. Ray Solomonoff, the inventor of algorithmic probability and
one of the key persons that very early understood the importance not only to try to create
practical learning system but also understand the theoretical limits and restrictions of this
processes. John Holland, the inventor of genetic algorithms. Marvin Minsky, one of the key
mighty researchers, the founder of the MIT early lab who was very influential in the early
development of AI. Allen Newell, the champion for symbolic AI, an inventor of many central AI
techniques, his colleague Herbert Simon who was a pioneer in decision-making theory and a
Nobel prize winner in economy and finally John McCarty, the founder of the Stanford AI lab and
the inventor of the LISP programming language so not only the Dartmouth College summer
school happened in the 1950s which is or what is were of relevance for artificial intelligence. A
lot of early work already occurred in this decade and i will first mention a few general things that
was important for the development of artificial intelligence as a field. So unfortunately one of the
key persons from the computer science era was not present in at the Dartmouth College, Alan
Turing. But already 1950 Alan Turing had published a paper called ‘computing machinery and
intelligence’ and the program proposed in that paper have been of very big importance for the
development of artificial intelligence and also in that paper automation of learning was given a
prominent role. Another result that was very early even before the Dartmouth conference was the
work at Carnegie Mellon by Allen Newell and Herbert Simon on what is called the ‘Logic
Theorist’. So the logic theorist is a computer program who tries to mimic the problem-solving

13
skills of a human being actually it's being called the first artificial intelligence program ever. The
purpose of that program was to be able to prove theorems in Whitehead and Russell's principia
Mathematica and this program even though some of the techniques wasn't even named at that
point introduced techniques that later had a big important for the area such as LISP processing,
means-ends analysis and heuristic search and so this is what was a key research coming very
early. Later in the 50s Simon and Newell continue that work and when they published results of
what they call ‘the general problem solver’ which is a computer program intended to work as
universal problem solver machine which of course then was in some way weaker than a logic
theorist because it wasn't so dedicated to one kind a specific category of problem but had at most
a wider range of applicability. Also in the 1950s work started on defining programming
languages which could be specifically useful to develop artificial intelligence systems and in
1958 John McCarthy created the first version of LISP which had that purpose today is the
second-oldest high-level programming language still used only fortran is older by one year. The
final result i want to mention is that Oliver Selfridge also one of the Dartmouth participants
created a system what he called Pandemonium which was one the first attempt to create a
computational model that can mimic pattern recognition of images of course inspired by the
models at that time of how the human images recognition system works. So what is interesting to
observe is on top the early work as described in artificial intelligence there is a substantial list of
things happening in this first decade that is essential for machine learning. So maybe first it's
appropriate to mention the person who coined the term machine learning that was Arthur Samuel
and he did that in 1959 and Arthur Samuel worked at that point at one of the IBM research labs
and one of his tasks there was to look at computer programs that could play check, so this is one
of the first example from game playing programs actually and in the context of these work of
producing a program that could play checker and compete with human players obviously he
observed that he needed learning algorithms so he included in his system some learning
mechanisms and when he published these results he was the first to use the tape term ‘machine
learning’. Going back a little actually before the 50s this the real starting point for work on
neural networks was as early as 1943 when McCulloch and Pitts in published their work called
‘neural networks as a model of computation’, so this was actually the first attempt to look at how
we view neural activity in the brain and try to mimic that in a computational model. So this was
very early but actually this work also was followed up in the 1950s. So people like Marvin
Minsky did some work in the mid 1950s on a system called SNARK which is assumed to be
actually that the first really computer implementation of a neural network machine and not much
later Frank Rosenblatt published his work on the Perceptron which is today considered the
starting point for the artificial neural network development. He was working then at Cornell
laboratory and it was a very successful work at first but it turned out that the way at least Frank
Rosenblatt presented the Perceptron and the initial applications had severe limitations so after
that it took a time since that area really got the pace. Not much later there was another kind of
work focused on sub symbolic representations so John Holland introduced in 1960 his first work
on genetic algorithms inspired not by neural activities but by Darwin's theory of evolution so this
was an even bolder step to look wider for the inspiration for computational models and finally

14
also in the mid-1950s Ray Slolomonoff published his first work on machine learning which
where we had a system which is termed in the inductive inference machine and as I already said
Ray Solomonoff became one of the keepers and not only to look at practical application of
machine learning but also on a theory of the field. So to sum up this lecture the first point is that
this lecture wanted to convey to you what kind of agenda the area artificial intelligence had when
it started 60 years ago, the second point is that already from the start machine learning had a key
role in the development of this area, the third point is that many of the kinds of machine learning
we see today had its roots already at this time so there were several contributions initial
contributions to development of neural network at this point. There were several contributions to
symbolic computation and learning in symbolic representations already at this time and even this
kind of machine learning algorithm based on evolutionary theory had its roots at this time and
finally machine learning theory also originated at this time. So thank you thank you for your
attention and the next lecture in this series will continue to give you a picture of artificial
intelligence and the role of machine learning.
Thank you

15
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 3
Intelligent Autonomous Systems and Artificial Intelligence

So this is the third lecture of the first week of the course on in machine learning and the
topic of this lecture is intelligent autonomous systems and artificial intelligence. When artificial
intelligence was founded in 1956, it was not so long ago less than 10 years since Thomas Watson
CEO IBM, made a famous statement that ‘probably not more than five computers are needed in
the whole world’. Of course some time had gone since that day, but still we had a world where
computers were rather rare, compare to other phenomenon, to other things. We may be have five
computers but we didn't have thousands of computers, we didn't have millions of documents,
have billions of computers and of course many early works in artificial intelligence could not
avoid to be influenced that we talked about we talked about computers and we talked about
whether the computers could do stupid things or intelligent things. Today the world has changed
and of course also everything has to change with it. So today we don't talk about five computers
we talk about billions of artifacts in the world that can be more or less intelligent and it can be
vehicles on our roads, it can be vehicles in a terrain, it can be vehicles on the water, it can be like
drones in the air, it could be all kinds of robots either in our industries but also in our homes like
vacuum cleaners and other machines and we can also look at swarms of robots that can solve
different problems. So we face now a situation where we are surrounded by potentially more and
more intelligent autonomous entities and of course there is a role now for various scientific areas
like artificial intelligence to try to make sense out of the situation we face. So what I want to
discuss with you now is an attempt to have an abstract view of an intelligent autonomous system
and for the moment let's forget that we have this gigantic system of billions billions of artifacts
that all stinger' each one of them can be intelligent and of course what do we say about the whole
system, that's another problem. So let's stay with looking at one of these entities at a time which

16
of course is a simplification but we have to start somewhere. So for any intelligent agent, human
or animal or artificial, there are 13 things that must be in place. So first of all as an entity you
have to have the model of the world you live in or act in, so this means that you have to have a
representation of what surrounding you so the model of the world is crucial and it cannot have to
have many forms but it have to be there, because if you don't have any model of the world how
can you how can you perceive, how can you think how can you can’t so this means the model
must be in place. Given a model of the world you have the first process which is the perception,
so when you act as an entity in this world which of you have a model of,you have the problem of
efficient perception, which means that you have to have measurements make measurements or
sensory observations and from these sensory observations you have to collect the relevant data
in every situation and from the relevant data you have to create abstract and then given that you
have a perceptions of the situation at hand then you need some reflection some analysis some
reflection some reasoning about the situation as a background for actions because for an entity in
the world you have to act there's no idea to perceive if you don't have an intention of acting. So
this cycle of perceiving, thinking and acting is very central to any intelligent being and only steps
and there are sub steps so for acting if you have to plan what to do and when you know have a
plan you have to configure actions and when you have set up a good configuration of actions as
with your plans then you have to perform this action. So we can say that and all these of course
all these steps have to be performed in terms of a model of the world and then at the bottom here
we have learning so this means of course for any intelligent being you cannot just repeat the
same behavior time after time you perceive, you think, you act, but you have also then perceive
again and observe the consequence of your actions and of course in order to improve you have to
have some learning mechanism that affects your thinking and your actions and based on earlier
experience, so this is one way of looking at how an intelligent system have to behave. So when
we compare computer science and artificial intelligence what can each of these areas contribute
with respect to building artificial intelligence systems, so of course you can do a lot with the
computer science but what I want to argue here is there are some fundamental differences
between what characterized the core methodology and technologies in computer science and the
keywords that characterized computer science are keywords like determinism, causality,
certainty, completeness, invariance and quantitative data knowledge. So the idea in computer
science is that you can characterize the situation you are in, in a complete way and then you can

17
create algorithms that in a deterministic and causal fashion derives some consequences. So it's an
ideal world in a sense because it presuppose at least for its partial situation some complete and
certain information in order to arrive the inference and as we all know an intelligent autonomous
system, lives that acts in the real world cannot always rely on such a situation, so this means that
for many purposes computer science make sense because certain situations are complete and we
can use computer science to devise methods that can make a system function however there are
many situations which are inserted where we lack of complete information, and we cannot rely
on only quantity data, we have to combine to use a qualitative data, quantitative data we may
look need to look for abnormal deterministic solutions and we may also need to have a behavior
of the system that is adaptive and not invariant. So for me this distinction these distinctions given
these keywords are the crucial difference and for me it's the basic argument why the artificial
intelligence techniques that set out to handle the non complete the uncertain, the dominant
aesthetic mixed qualitative situation and in the long run will have a better prognosis of making
useful systems. So if we now turn to how work research in artificial intelligence has been
performed during the last six years, the key division work meet on the three aspects knowledge
representation, automated reasoning and machine learning. Knowledge representation in
focusing after what in the world, automated reasoning focused on how based on perception make
the necessary steps to generate adequate actions and then finally machine learning which are the
mechanisms to make it possible to improve the reasoning process, so this is one dimension then
it's obviously so that for knowledge representation, artificial intelligence has worked with two
kinds of representation types, symbolic and sub symbolic and this is have of course also effect
the kind of reasoning you can perform so for symbolic representation it's one kind of reasoning
to sub symbolic another, for symbolic representation is one kind of learning techniques and for
sub symbolic another and you can see on this slides examples of the specific representation and
the specific reason in a specific very needed, so this slides shows you how work in artificial
intelligence can be indexed scientifically. But it reasonably well maps also on the earlier slide
that came a picture how our intelligent tunnel system could function. The core artificial
intelligence I tried to describe on the earlier slide, however artificially there is a border line of
areas around core artificial intelligence which are many case assumed to be involved in the
concept and I will call them the second rim here, an example of such areas are machine
perception, computer vision, robotics. Most people would say of course robotics is artificial

18
intelligence but still a very specific area with the specific issues but sharing the core issues of
artificial attempt. Another areas game-playing optimization is use in technical systems, dedicated
data mining, language engineering, speech technology, intelligent interfaces our expert systems.
These are all kinds of special systems that in most cases and in the name the conception of the
area is counted as part of artificial intelligence but it makes sense to differentiate between these
extensions from the core technologies and most of these extensions really share the essential
parts of the core. So with all these advances that now happens regarding artificial intelligence of
course there are many theoretical and philosophical issues that came to one's mine, so are
theoretical limits for the intelligence of artificial systems and in that case what restrictions and
other issue is of course that worries people a lot is a very practical issue will the emergence of
intelligent efficient system dramatically reduce the human workforce. Obviously of the quite
obvious and many of these intelligent artifacts meant to do tasks then it historically has been
done by humans so obviously the existence of intelligent artificial systems will change the
workload there will be more jobs or a few jobs remains to be seen but it's not so much a technical
problem as a society problem another issue is will the excesses of the system dehumanize the
human life which means as so many things happen are without human intervention so will our
life get less human because we interact more with artificial agents, on the next level one can also
then start to think are there some intention coupled to this kind of system or it has just do simple
tasks so could one regard them as benevolent or malevolent, does it make sense to attribute
ethics or morale to the artificial systems which of course then very closely relate is it meaningful
to talk about that is these kind of systems of intentions, their consciousness, do they have a mind
and of course the ultimate questions that is thought out is it so that at some point these systems
will entirely take over so are the existence of the system an existential threat for the human race,
it's called a singularity. I don't think that the present moment there is any clear answer to any of
these questions, for the moment many of the applications are pretty simple so we are pretty much
in the beginning of this development so therefore many of these worries is not very relevant
initially but this development will go on and after a while many of these issues will come up and
become more and more crucial over time. So thank you for your attention the next lecture will be
now on applications of machine learning.

19
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 4
Applications of Machine Learning

So, this is the fourth lecture of the first week of the Machine Learning course. The topic
is Applications of Machine Learning. This topic is, say, more or less independent of many other
topics of the course. It's placed early in the course in this position because, for many techniques
we are going to talk about, we will refer to applications, and therefore you I think it to be a good
idea to introduce an overview of applications for you to have something to refer to.

So, when you open a newspaper today or local television, it's not unlikely that you can see ads or
other programs or articles which announce success stories of Artificial Intelligence applications.
So the question is how can you make sense of this flood of new inventions that claim that all
these products are becoming better because they applied Artificial Intelligence? I mean I saw
recently announcements of new kinds of AI focused TV brands. I mean all the major products
within the smartphone area, by companies like Huawei, Samsung, Qualcomm, they launched AI-
powered smart phones. Even Burger King boasts that their ads are AI-written. Some of these
announcements are pretty serious actually, but some of them are just crap. But one could say at
least, with some certainty, is that a majority of the real stories - the real artificial intelligence
success stories that you see in the media - relate to the application of Machine Learning, more or
less only. So even if the ads claim that it’s an application of Artificial Intelligence, it's not
actually any application of Artificial Intelligence in the broad sense. Plus we have described
earlier this week that with the pretty narrow application of Machine Learning techniques it still
might be very useful. But in order to understand what we actually are doing I think this is a very
important fact. The second pretty certain fact is that out of the many Machine Learning success
stories, a majority of these success stories that are announced today relate to Image Recognition
or Speech Recognition. Maybe there are a lot of different kinds of applications which have been
improved, but the source of improvement is that the Image Recognition in these systems have
become better, or the Speech Recognition. So this is just to give you some sanity check of what
is probably going on. But when you see these ads, some systems announced may be very broad -
may be very broadly using Machine Learning, very broadly Artificial Intelligence, but it's a
minority of the systems.

20
So in order to give you some aid in analyzing what is really going on with respect to
applications, what I've tried to do on the current slide is to separate what I call general
application sectors from what I call specific categories of Data Analysis. The general application
sectors are, I would say, trivial or straightforward. They are well recognized - medical diagnosis,
personalized treatments, drug design, driverless vehicles and household robots, personal
assistants and recommender systems and navigators, adaptation of communication and social
media services, marketing and sales, optimization of technical processes, monitoring and
surveillance, financial services, cyber security, and machine translation. All these are rather
common-sense application sectors. And when you look at the announcements of progress, of
course always these relate to some of these. What is a little more innovative on this slide is the
introduction of what is called specific categories of Data Analysis, because these categories
represent some kind of middle level of analysis - which means that you not only say that you
make this really great medical diagnosis system using Machine Learning, but you really instead
you say “Ok, there is this medical diagnosis system using Image Recognition, or using data
mining for large data set, or using text mining”, and then indirectly you say that “Oh in the
emignator image recognition system you use machine learning”, because that's what's actually
the truth. So what I claim here is that it has more explanatory power to introduce these two levels
and explain the relation between Machine Learning and the real application sectors in two steps.
And the five categories I here introduce in the Data Analysis - Image Recognition, which I see as
the foremost sector, the category for the moment given everything that's happened at that is the
point, Speech Recognition is also very important, Data Mining for large datasets is the third,
Text Mining of large document collections is the fourth, and Dynamic Adaption on technical
systems is the fifth. In the coming slides I will try to say a few words about each of these
categories.

So starting with Image Recognition or Computer Vision, which is another word for it. Image
Recognition or Computer Vision deals with how computers can be made to gain high-level
understandings from digital images or videos. So the area seeks to automate tasks that the human
visual system is assumed to solve. So, we could say we want to mimic here, somehow, the
human visual system, not entirely but partly. So Image Recognition in my view is currently the
most successful application domain for Machine Learning techniques. Many of the real success
stories we hear about is coming from this area. Recent advances in medical diagnosis, for
example for different kinds of cancers, and the key components in self-driving vehicles, are
primarily due to the progress in Image Recognition. So if we look at some sub-domains, we can
see what is important is to detect objects, it is important to detect events in images. We also have
the issue not only about static objects, but also about moving objects - so actually tracking videos
and doing motion analysis. Another part is not just looking at objects - static or moving - but also
looking at whole scenes, including many objects, moving or not. A fourth issue here which is a
growing area is also how you can work with images - how you can restore images if you have a
partially damaged image or a bad image - how can you restore it to a state where it's useful for
different purposes. And of course there are many phases to this kind of system, so of course you

21
have to have a good system for acquiring images in the first place for pre-processing them. A
very important part is of course Feature Extraction, because an object on an image may have
thousands and millions of features. And of course that’ll make sense to some analysis or learning
based on all these possible features that you get from extraction or selection. So then comes the
phase of detection. Maybe you’ll have to divide up the image into different parts, into different
segments. In order to simplify the work comes high-level processing. And finally the stage of
decision making based on what image you have analyzed and found.

So, Speech Recognition enables the recognition and translation of spoken language into
computer readable text. In speech recognition, both Acoustic Modeling and Language Modeling
or Language Engineering are important parts in the modern statistically-based Speech
Recognition systems. Actually, Speech Recognition in my opinion is currently one of the most,
or I will say the second most (after Image Recognition) successful application domain for
Machine Learning techniques today. There are a lot of advances reported on Personal Assistants
today. Also Machine Translation systems get much more usable. And also for various
communication systems there are speech interfaces that become more and more used and
accepted today. And I would say that in most of these cases it's not so much the underlying
reasoning about the domains, it's not so much about the true language understanding, but it's
rather the Speech Recognition that has been improved that makes these systems more accepted
and useful today. There are two kinds of recognition systems - on one hand you have Speaker
Independent system that could be used by any person that happened to use them, or you have
Speaker Dependent systems where you actually train the systems based on a specific person's
voice – so when your tailor the system and fine-tune the recognition for that specific person. And
also related to that, of course, is the technology that can be used for identifying speakers.

So the third category is Data Mining, or synonymously Data Analytics for Large Data Sets. Data
Mining is a process that is used to extract actionable data patterns from large data sets.
“Actionable” is a pretty important word here because you don't do these analyses for fun, you do
it with the purpose and many times, or most of the times, the motivation for doing data mining is
a business oriented motivation and to support a wide range of business decisions that could be
ranging from operational to strategic. Data Mining is primarily applied in weak theory domains
to discover new patterns, and it is also related to other areas like Data Discovery, Knowledge
Discovery, Business Intelligence and Data Warehouses. These terms or concepts are more or less
synonyms. So the data set analyzed in Data Mining may be user input to e-commerce system -
that's a very important source today. Or such data could be also user input to personal assistants,
to recommender systems, to main systems, or social media. It could be data collected from
industrial processes, from production processes. And it could be data collected from online
learning systems, where there is a specific term for this - Learning Analytics. But it could also be
large factual, financial and statistical databases. For speech and for image recognition that we
discussed earlier, at the moment the artificial neural network techniques are preferred as a
technique and have had many successes. But for Data Mining, still a variety of different Machine

22
Learning techniques can apply and there is no such clear bias on which technique is the most
successful of them.

So the fourth category is Text Mining or Text Analytics for Large Document Collections. We
still talk about large sets of data, but in the last category when we talked about Data Mining, we
mostly referred to structured data. Here we have just normal texts. So the goal for Text Mining
or Text Analytics is the process of exploring and analyzing large amounts of unstructured text
data aided by software that can identify concepts, patterns, topics, keywords, and other attributes
in this text data. So, Text Mining needs a combination of Natural Language Engineering
techniques and Data Science techniques or Machine Learning. Typical document collections
targeted by Text Mining are - business related documents, scientific publications, governmental
reports, and News streams. Multiple language issues is a key thing here because, of course when
you analyze the topic you cannot just say we will look at articles in English, or in Chinese, or
Japanese. You have to handle multiple languages, so therefore multiple translation issues are
crucial in text mining. Many applications need incremental Text Mining for new documents to
augment knowledge-based systems or expert systems. So, for example, you have an expert
system in medicine, or in law. For example, those systems gets old after a while, there comes
new information all the time in document form and in order to keep the systems up-to-date, you
actually need to extract the relevant new facts from the new document and augment the systems
based on those.

So the last category of data analysis here I call Adaptive Control of Technical Systems. I mean it
is actually a different species of technology we are talking about here, because Adaptive Control
is the control method used by a controller which must adapt to a controlled system. The
parameter of the systems are typically uncertain in the beginning of the process. Of course the
purpose is, during the process, to set values on the parameters to acquire an optimal solution.
Analytic Adaptive Control methods are of course focused on truly optimal solutions and
algorithms for the exact computation of those solutions. What machine learning can contribute
with here are methods that can function and create good solutions even if optimality cannot be
guaranteed, and the most relevant Machine Learning technique in this context is Reinforcement
Learning, which then actually can function in an absence of a mathematical model in this case.
One example of this is robot learning which studies how, for example, a robot can acquire new
skills or adapt to an environment during learning. So Reinforcement Learning really looks into
how software and hardware agents take actions in an environment so as to maximize its
capability. Of course it's needed that the results of earlier actions are systematically logged or
documented so that these outcomes can then be fed back into the system to allocate some kind of
credit or blame to earlier actions, and modifying the rules of the system and thereby optimizing
the future performance.

So, having presented these five categories of Data Analysis that are all dependent on Machine
Learning or Data Science statistics, I now want to make one example to advocate that this way of

23
analyzing things have an explanatory value. So, just look at medicine and applications in
medicine, and what I do in this slide is I try to exemplify that in medicine you can have
applications of all these five categories that make sense. So you can have one application in
medicine where you actually use image analysis for diagnosis of breast cancer. That's what you
do and you can have a great success in doing this. The second example in medicine is a big issue
- how to efficiently file medical records in clinical work, and to have a really good speech
recognition system that helps doctors and nurses to file medical information in the records in an
efficient way. It is a good help, and that is another kind of application. The third application is to
use Data Mining to explore all the large clinical databases that exist to find patterns. But still,
clinical databases are structured data, it's not images. So it's another form of analysis needed.
And the fourth part here is mining text because in the medical area there are a huge number of
new articles published all the time, and it's a big pressure on professionals in medicine to follow
the advances through the publications. So a system that helps the professionals to explore all new
information in these publications by doing text mining of these is also a very useful system. And
of course these kinds of findings could also be used to update medical expert systems as an
indirect effect. The fifth area, as you know today there are many experiments with surgical
robots and of course surgical robots also have to be trained to do the optimal movements or
perfect movements. So training of robot movements for surgical robots is clearly a case of the
fifth category. So by this slide I want to illustrate that it has some explanatory value not to just
say “Oh, we have this fantastic application in medicine and it uses Machine Learning”. It makes
more sense to first analyze what we are doing in one of these five categories and then in the
second stage say “Oh, the image analysis is done by using Machine Learning”.

So this was the end of this lecture. I thank you for your attention, and the next lecture will be on
the Tutorial regarding the assignments for this week.

24
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 5
Tutorial for week01

This is the last lecture for the first week of the machine learning course. As I hope you have
understood the last lecture each week will always be a tutorial an introduction to the assignments
of the specific week

Assignments can generally be of two kinds, they can be recall related questions, referring to fact
that has been presented on the lectures or it can be having more of a problem solving character.
For coming weeks we will see more of our problem solving tasks in assignments but for the first
weeks for the first week we are now the assignments will be more of the character of recall. So
for the assignments on this course and every week there will be two categories, one category will
be questions relating to what has been directly described or shown on lectures the other category
relates to extra readings that are proposed for each week. This week because most of the
questions are recall and there is no need for further instructions regarding the first category, so
please just take a look and do your best.

It is more interesting and important though is to talk little about the readings extra reading
required for this week, which are crucial for being able to handle the second group of
assignments. So as I've said earlier the ambition on this course to give readings systematically
give readings that refer to what I would characterize as original work, not second-order work but
rather original and so for this week I have chosen five articles which I hope you without problem
can retrieve through the given links. The first article is a seminal publication by Warren
McCulloch and Walter Pitt from 1943 when he published the first description of ‘a logical
calculus of the ideas immanent in nervous activity’ so this is by most people considered starting
point for all work in neural artificial neural system, so that's the first article recommended. The
second article which as is as important is the key application by Alan Turing from 1950 named
Computing Machinery and intelligence, where Alan Turing explicate his thoughts about
computers and their ability for intelligent behavior including learning. The third document is
actually John McCartey original proposal for carrying out the Dartmouth summer research
project in artificial intelligence as described in a lot in the earlier lecture. The fourth article is one

25
of the key publications on Arthur Samuel from 1959 where he describes his work with
automating the game of checkers but also introducing machine learning as a term referring to the
learning mechanism he felt he needed for his game playing program. Finally the last publication
is the first publication by Frank Rosenblatt from 1957 about the perceptron which considered
today as the first instance of a neural artificial neural network system. So looking at the questions
for the second part of the assignment, I guess you could feel that this kind of recall questions are
a little silly but you must understand that the questions are there as an important part because I
want you I want to push you to read these important original publications crucial for
development of this area. So this was the end of week one of the machine learning course. I want
to thank you for your attention the topic for the theme for the next week will be ‘characterization
of learning problems’

26
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 6
Characterization of Learning Problems

Welcome to the second week of the machine learning course. The theme for this week is
characterization of learning problems. As you know this course of many weeks and we will
gradually become more concrete regarding various algorithms and the application of those
algorithm into a specific problem. However this week has the purpose to give you a general
overview on machine learning and its varieties, also various situations and scenarios for use of
machine learning and crucial issues to consider in those situations.

So we will start discuss one major distinction regarding the role of machine learning. So a lot of
this course will be focused on called data analysis, which means analyzing large data sets to
create distractions these abstractions can then be used for various purposes. The second context
though is what is termed here adaptive systems and as you know, we will are surrounded by
more and more of autonomous systems of various kinds that duest services and in general these
autonomous systems in order to be useful need to be adaptive. Of course also here machine
learning as a growing role. The distinction between data analysis as such and use of machine
learning in adaptive systems it's still a useful distinction it's still very relevant for sorting out and
understanding various approaches to machine learning but of course the borderline between the
two shiners tend to become increasingly blurred ten-twenty years ago, it was a clear demarcation
line between the areas but it not so today. Of course in general for data analysis the major reason
for all learning is to ultimately act better in future situations. So let's turn at first to say something
about machine learning in the context of adaptive systems, so there is one major paradigm in
machine learning called reinforcement learning you may have heard about it and I’m sure you
will hear about it more examples of the use of reinforcement learning are in the context of game
playing but also in the context of robots that means how to adapt robots so at the movements of

27
the robot get more optimum. So there's a pretty abstract model of reinforcement learning in the
sense that it's viewed as an agent that I’ve seen in an environment and the agents perform a series
of actions and the environment as it is supposed to react and give feedback based on those
actions and in this case we term this kind of feedback the reward and of course in the long run
the idea is that the series of actions and the reward given from the environment to the agents is
supposed to iteratively improve the functionality of the agent and in the long run create some
kind of optimal performance. Reward of course it is a general word it can be negative can be
positive. So we can say reward can be credit or can be blamed and of course in any concrete
application of this abstract model it is assumed both that the environment in question is able to
provide reward in a concrete form and that the agent of course is also able to internalize this
concrete reward and adapt the internal behavior of the system typically this kind of system has a
number of internal parameters and what happens when the reward is considered is that these
parameters are modified but it could be in general any change of the internal structure of the
agent that can take place in this context. So to continue a little about reinforcement learning as I
just said it's absolutely important that the agent can concisely manage the handling of the reward
and as a consequence update the internal structure of the agent itself. It is further assumed that
the typical scenario for reinforcement learning is a strong theory based hardware and software
system so we are not talking about here a system where we learn in the absence of a theory we
rather learn in presence of a very strong theory so just conceived. A robot normally, a robot is a
very advanced design and if we used reinforcement learning just to optimize that the movement
of the arm, we can say that we actually this kind of robot system learn on the margin of its
already existing behavior. We will look into different kinds or ways of normalizing this later in
the course like already now can say that one very commonly used way of modeling the
environment in this kind of case is, using what's called the markov decision process and also then
in turn a very general applied mathematical technique called dynamic programming, so these are
very typical tools in the toolbox of engineer in this respect, but as you will see later
reinforcement learning can be realized in many different shapes. So now I want to turn to data
analysis, and the first thing I want to do is to discuss with you a little about what I call the end to
end process for application of machine learning to real-world problems. So as you see on the
slide there is long list of steps to consider and rather far down the line in this list you find the
core analysis face. So many things I assume when we think about machine learning in its

28
application we talked about the core analysis face when we have an algorithm and we apply that
algorithm to a dataset and then we get some output from that process but as you see from the
whole list this is absolutely not the only saying. There are so many other steps needed in order to
solve real problems and of course the first thing is to harvest the data and for realistic application
is absolutely not certain what the data are, what form it has, it can be dating in many forms from
very many sources, and potentially then there is a lot of work to harvest that from all these
sources and bring it together and then of course because the forms very different then need to be
some pre-processing of this data, I mean a typical case in which the input and the data is in
image form or like just sound or something else, then that has to be transformed into a digital
form that can be reasonably used for a machine learning algorithm and then of course one also
will consider whether we are learning in the absence of a theory more or less we have a lot of
data and we really don't have a theory or the contrast is that we have a very strong theory so what
as I said earlier enforcement learning we learn on the margin on prior knowledge. Exactly where
we put the border between machine learning and some other engineering skills, so little unclear
but I assume when we come to feature engineering it's clearly with the realm of machine learning
and feature engineering is of course the skills that we will come back to that is needed to express
the data items or data set in a form that is manageable for the algorithm. So again since we've
given with the devised and fabricated the right kind of data set, we have to choose an algorithm
that is suitable for the situation but it's not normally not just to choose an algorithm because an
algorithm is not a very seldom a black box, an algorithm can be adapted and should be adapted
probably in order to give a good performance and so typically whenever to adjust the out briefs
when talking about hyper-parameters of algorithm, language bias, complexity, management
issues. Okay, so when then if all of this is done then the core analysis can take case, but when
that face happened normally there are more things to do because it's not certainly so that the
output from the application of the other create data in the form that you require so that needed
some thought processing and finally of course in order for the result to be used either in a system
directly, online updating and you need to prepare the form of the output the other case is of
course if you want them use it for decision-making you need to prepare a material that are useful
for decision-makers.

So what we will do now is to slowly convert on the key topic of this week which is classification
so I can say that two main scenarios in data analysis one important area which we actually will

29
not spend so much time on is was in statistics is called a regression. An regression is essentially
to use the analysis of all states to establish a means of prognosis for future states and in contrast
to that classification is to look at descriptions of objects and abstract from those objects trying to
define concepts or classes so that the definition of those concepts can serve as a basis for
classifying new objects in the future. So regression is about prognosis for states and
classification is about the base for being able to classify objects not seen so far. So saying if you
work more about regression being as the technique from statistics used to predict values of a
target quantity when the target quantity is typically continuous and you see to the right an
example how it could look like a very simple example which is a linear dependency between two
numerical variables and I mean such variable there are like millions of examples of that situation
so the numerical values could be the length of person as a function of the person’s age, it can be
the price of a property as a function of a point in time and so on, but as you understand I mean
this is just the simplest kind of example many cases you state can be something more complex it
can be a bundle of variables can be some structured set of valuables and regression of course
makes sense as a concept also for those more complex cases. So from now on we only focus
mostly on classification or as we will synonymously call it concept learning and we will have
four lectures with more this week three lectures on the keys used and the fourth lecture on the
tutorial for the assignments. So the three lectures of the first will be on objects categories and
features, primarily sort out the terminologies, yes because it's a pretty complicated situation and
part of the difficulty of reading about this area is there are so many terms around they're often
synonymous terms. The second lecture we will talk about features and the rollover feature
engineering and finally in lecture 2.4 we will turn again to what is here called scenarios for
learning where we will talk about a number of important distinctions such as supervised contra
unsupervised, learning online versus offline, instance base versus abstraction based learning and
so on. But before we leave this lecture when I’m going to introduce a simple example that will
be referred to both in the lecture series but maybe in the in forth coming so of course over week
we will introduce the few examples that will be the base for various discussions. As you may
remember from the first week something that exists at the moment is a growing set of
repositories for datasets of different kinds many of them can be opened access, some are more
restricted and then and so on. So what I have done here is, I tried to select the reasonably limited
them dataset from on a research repository in the UCI ML repository and essentially the name of

30
the dataset is a zoo dataset. I mean if you look at the whole repository TAS 351 dataset some
datasets with a huge number of data. So my selection is just was very practical here I need
something simple example to use as a basis for some discussions and so this is a very naïve and
partial classification of animals and there are 107 objects or data items and they are each sector
at characterized by 18 features, 18 is not so small but not so large either and essentially these
hundred and thousand objects are actually categorized in pre-categorized labeled in seven
categories so that that's the example we will look at. So for this data set the zoo data set and we
have a class structure or category structure which is essentially focusing on the classification of
animals on a certain level so actually this whole data set starts from single examples of specific
kinds of animals and the task is to classify those in terms of seven categories which are all
animals and shows and categories are mammal, bird, reptile, fish, amphibians, insects and
invertebrates. That's how the data sets it up so it means that every single data item in the data set
is labeled with one of these seven items. So next we come to the features in this data set, so as
you will soon see every single data item is characterized by one of these, all of these features and
most of these features are Boolean, so they are either 0 1 so an animal as hair or it has not hair, it
has feathers or not feathers, it lays eggs or not, so they are Boolean. And the only exceptions are
the only I would say predictive feature, that is not a boolean is legs because legs is an integer and
tells how many legs that animal has. So yeah we will discuss the engineering of features and in
one of the later lectures, but you can see here of course that one issue that is important with the
choice of features is that the feature set should be rich enough to cover all the seven categories of
animals that we are interested in classifying. If we choose too few features or choose the features
with some bias, it may be is that the feature set would be perfect to classify birds and useless to
classify mammals and so on, of course if we one could want to have hundreds and hundreds of
features it's not a problem because we could bring everything but as you will see later it's often
desirable to keep the feature set down and then one have an issue of choosing the appropriate
features. There are two kinds of features here that are a little special because they characterize or
classify the right item and the important one for the task is class type which is then always one of
these seven types so all that it is them are pre classified and given a class time. But then of
course you also have something a called an animal name and that's essentially saying that this
animal is an ape, this is a horse, this is a dog and so on and actually this data set only have one
instance, only one data item with the same name so now are not several examples of dogs not

31
several examples of cats it's a single well-defined characterization of each of these kinds of
animals on that conceptual level. So here you'll finally see the whole data set, all the 107 or our
objects it's a kind of handy that we can one can show them all data set on one slide as I said
earlier there is only one example of each animal almost a kind of basic level, so it's one dog, one
crab, one penguin and as on duplicates of animals on this level. And so the data set as is it given
focus on a task where the analysis is from basic animal upwards towards animal and as we will
discuss a little more later of course every domain has many levels I mean in zoology a lot of
people have spent years and years of finding the appropriate classification from the very lowest
level to the highest level and it's not always relevant to discuss things on all levels normally you
won't choose a level which is relevant for the problem at hand and this is just an example of such
a choice. So this is the end of this first lecture and thanks so much for your attention I hope you
start to get a feeling for the whole area now and we will go into some more details and what we
will do on the next lecture we will look little more about the terminologies here in particular the
terminologies for objects, categories and features to sort out what is what.

32
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 7
Objects, Categories and Features
Welcome to the second lecture of the second week of this course on Machine Learning. This
lecture will focus on the characteristics of Objects, Categories and Features, essentially the
building blocks needed in the area of classification and concept formation. This lecture will
discuss the following items: first we will talk about Objects and Features, and then we will also
discuss something we call the Object Space and something we call the Object Language, and
furthermore we will discuss Categories and Category Structure, and what we call the Category
Space and what we call the Hypothesis Language. So the first group of items refers to the way
we phrase our data and the second group of items refers to how we phrase our abstractions.

So let's start to talk about Objects, and by Object I refer to the basic data items on which we base
our analysis. As you see there is a very complex journal of terminology, people use very many
words. “Object” I like personally as a term, but I think in many cases Data or maybe Data Item is
a good word because it is understood by many people. Some people talk about Record, some
people about Tuple, some people talk about Row, some people talk about Vector. People from
statistics talk about Instance or Example, or Training Instance or Training Example. There are
also some more exotic terms like Thing or Entity. But essentially all these terms are supposed to
refer to the very concrete Data Items or Data Objects on which we base our Analysis. So then the
question is how do we characterize an object. And then we come to the use of the word Feature.
So in the same fashion that there are many synonym words for Object, there are many synonym
words for Feature. So I highlighted Feature as I highlighted Object, but there are of course
alternatives - you can talk about Property of an Object, you can talk about Attribute of an Object,
Characteristic of an Object. You can talk about a Field, but a Field in a way is compatible with
Record. You can talk about Column, but then that also is somewhat compatible with whether you
talk about Vector or Row and so on. And then we have words which are primarily used in
statistics – Variable. And for Variable there are various variants of that. We can talk about
Output Variable, we could talk about Independent Variable, we could talk about a Predictor
Variable. We could talk about variants of Feature, we could talk about Target Feature, we could
have a Category Feature. And of course we have special semantics here - I mean a Predictor

33
Variable and an Independent Variable more or less is the same thing. And it is separate from an
Output Variable because normally when we talk about Output Variable we mean the kind of
class label that relates to the abstraction we want to create. The same is for Category Feature, so
Output Variable and Category Feature could be more or less synonymous. Unfortunately I don't
think anybody can assume that people will be totally consistent. I must admit that I myself
switch between words. Sometimes it feels right to talk about Data Item, sometimes I feel right to
talk about Object. But as long as one has the basic understanding that there are two kinds of
phenomena here - on one hand there are the concrete objects and there are synonyms for that,
and then are ways of characterizing those objects and then there are synonyms for that. So finally
we can say that you need a formalism to express these things. It's not something very advanced.
It's just a simple conventional formalism in which you express your Features, and of course some
formalism in how to group Features so that they can form Object. But when we refer to this
formalism of this kind of simple language in which Objects and Features can be expressed, we
normally talk about the Object Language or the Data Language or the Observation Language.

So, just a few words about the types of Features. Now of course there can be many types, but the
common types or normal types that we will see within this course are the following: so first of all
are the Ordinal. You can call them Binary or Boolean. And then we have the Numerical which
can either be discrete or continuous. And finally we have the Symbolic Features which are
simple symbols, or we can have simple Structures - the most common there is of course Lists,
but it could also be more complex like Graphs and so on. Here comes one example from the Zoo
Dataset: one data item picked out happens to be a Buffalo and as you can see the Object
Language here is a very simple one. It's simply commenting for listing the features in a row. And
you can see then below already that the features here are mostly Ordinal, only one is Discrete.
Two features are special because they are not predictor features, in the sense that they predict the
class, that they are classification features that pre-classify each object. There are two here
because on one hand the task is to classify these animals on a higher level, so the class type with
values from 1 to 7 issues for that. But then there is also an animal name which is actually the
name of that specific kind of animal, and because there are no duplicates here, it's over Buffalo
or of some other animal. It's just the identity of that item.

So let's talk now about Object Space. So by Object Space I mean essentially the space of all
possible objects as can be described by the Object Language. So depending on the language and
this can be bigger or larger. But this should be separated from the set of objects that we consider.
So this means with this convention the Dataset - the set of objects considered for evaluation - is
just a subset normally of the Object Space. And depending on how the features are set up or
chosen, this could be a dance or spares space, meaning that the subset is a pretty small subset out
of the total Object Space. But there can be other words here - the Sample is a word of statistics

34
which also means the set of objects you actually look at, Training Sample, Statistical Sample.
Then one can also use words related to Vector or Record, we can talk about Tables and Arrays.
So if the data item is a row, the feature is a column then the Dataset is the table. And of course an
Array is a way of representing a Table. Training Example Set is another word that can be used
for a small subset of objects that we consider.

So let's take a few examples from the Zoo Dataset. So we look at the Object Space first of all
again. So the Object Space or population is the set of all potential feature vectors with feature
values as can be expressed in the zoo Object Language. So Objects Space in the way I phrase it
here is syntactically defined by the object language. The sample or dataset in this example is the
whole set of Zoo feature vectors as described on one of the earlier slides. The word that is a little
tricky to use because it's inherited from philosophy is Extension because normally in philosophy
extension of a concept means the set of all entities in the world. So the extension of the Concept
would mean the set of all buffaloes in real life it's a tricky to use because not necessarily we have
chosen our representations in such a way so that they truly represent all living creatures on a
certain map, so therefore I would avoid the word extension.

So now we come to what is called the Hypothesis Space and actually how they are defined from
the definition of a category. So actually if we look at the term category there are many
synonyms, really a lot. So we talk about categories, we can talk about concepts, about classes,
we can talk about hypothesis, we can talk about target function, we can talk about type, about
schema, about model, we can look at a classifier, we can even talk, if we are a little more
philosophical, about the intention of a concept. All these are words to capture the abstraction of
the concept. The hypothesis language, the language for expressing the abstraction, can of course
theoretically be any language. It could be different from the Object Language. But in many
cases, very typically its syntax wise consistent with the Object Language. Of course there have to
be, even in the simplest cases, minor additions. So for example the syntax for feature values has
to be extended with suitable generalizations of the normal values. Because the hypotheses or
concept in question is always an abstraction from a concrete object, so if we have a complete
object expressed in terms of features with concrete values and we want to formulate that
abstraction, we need some wild cards. We need some variables that allow us to generalize the
feature values. Normally we have a variable background knowledge that also can constrain
which kinds of abstractions are meaningful, so this constitutes a-priori knowledge outside the
knowledge that we get out from our objects or from our data. So this background knowledge can
normally be embedded in the hypothesis languages through simple language constraints and if
we embed this knowledge in the language we normally call this a language bias. It actually is a
learning bias but because we can only learn what the language allows us to learn. Finally what is

35
a hypothesis space? So the hypothesis space is actually the space spanned by what can be
expressed in the hypothesis language.

So let's talk a little more about terminology here. So we introduced the word object, we
introduced the word Category. We have a definition of a Category expressed in a Category
hypothesis language. So when you're talking about it, an object is an instance of a category and
of course a category is a generalization of an object. But then we also have the relations to the
object space. So essentially given a specific category definition one can look at the subset of the
object space consistent with that category definition, so this means that all objects that fulfills the
constraints of the category definition. This space subset sub Space is normally bigger than the
actual subset of the dataset consistent with the category definition because there may be a lot of
objects that is not there at this point but it's still expressible in the language we created. So of
course there is a subset relation between the subset of the dataset and the total subset of the
object base consistent with the category definition, and of course an object is an element of the
subset of the dataset. So this is just a little picture to show you these kinds of relations that exist
within this terminological framework we have set up.

So you may have found the discussion around the last slide little abstract. So let's take a very
simple example from the zoo data set. So actually here you find the subset of all fishes taken out
from the data set. Okay. So then the question is what is a meaningful generalization of all those
data items? The simplest way of handling this is of course, first of all to choose a hypothesis
language which is close to the object language, which I've already earlier stated that is the
normal way of going forward here. And apart from following the same instruction what we will
have to do is of course to introduce some generalizations of feature valuesthat in this case we
only have ordinal values, so we only need some kind of wild card here like question mark. So
you can see in red and an abstraction or an abstract record, which is in this case the concept
definition, where you find wild card in the positions where the subset of the dataset includes
contradictory complementary values. So going back to the last discussion on terminology we
have a dataset which is a subset consistent with the concept definition. Then theoretically of
course we can think about that object space which is slightly larger because there are typically
combinations of the wild card values not covered by the actual subset of the dataset. But it's not
explicitly depicted on this slide.

So now we turn in to something different. So apart from just looking at objects and categories,
we can also look at category structures which is very typical in any domain analysis, which
means that we not just have a simple level of one category but that we characterize objects in a
domain in some structure. And such conceptual structures or category structures also have many

36
names like class structure, class hierarchy, class lattice, concept structure, concept hierarchy,
type structure or taxonomy. And of course it's possible in machine learning not only to learn one
level of categories, but also to learn multiple levels of abstraction, which means it's possible to
learn category structures. So apart from persistent and domain relevant category structures that
really make sense in the domain, it is of course also possible to work with temporary structures
as part of the learning process. We will come back to that, but examples of such things like
Version spaces.

So obviously when we add levels here with categories on several levels of course we can
generalize. So we can define specialization and generalizations, relations between all these
levels.

So finally in this slide and the next slide I'm just going to illustrate for you how it can look in one
case. So the most common case is when one looks at concept hierarchies, which is a tree
structure. So we abstract from objects in several levels in a hierarchical way. The other variant of
that is what we call the concept lattice and that is a very similar thing. The only difference here is
that in a lattice it's allowed to have multiple generalizations upwards. This means that you can
generalize from category six to category three or category two in parallel. Many times in reality
lattice better maps on to the real situation, but technically in most cases hierarchies are more easy
to handle.

So to convince you that multiple levels of abstraction, multiple levels of categories or concepts is
not only purely a theoretical thing, I show you here an extension of the zoo example where I've
actually just followed one path from one kind of item in our dataset - the Buffalo. And actually
what you see is all the established zoological super- and sub-categories, and as you can count
there are thirteen or fourteen levels from Buffalo up to animal. Of course this is a domain where
Zoologists have worked for hundreds of years with the effort to find the kind of optimal
taxonomy for the area, so of course it's not likely any new or more artificial domains that you'll
find this depth. But I take it to show you one extreme case and illustrating that many levels of
abstraction is not just a theoretical thing, but that it occurs in practice. Another observation on
this line which is here in parenthesis is that if you remember the features that were used in this
dataset and then compare to features that you can infer to be important from this kind of
categorization you will see that they are not identical. Not surprisingly so because actually what's
shown on this slide is the line from the Buffalo to the top. And of course if we only want to
classify animals on that line starting from Buffalo going up to animals then a certain set of
features would be optimal. In the general case for our dataset we want not only to classify
mammals, we want to classify fishes and insects and so on. So this means that for that general

37
scenario we need a broader range of features than the dataset that is optimal only for going down
the Buffalo lines so to say, so everything here of course depends on the purpose.

So finally I wanted to comment on what happens to features in this kind of more abstract
category structure. So of course when a conceptual structure is formed during the learning
process features will be attributed to the categories of different generality. In the simplest case
we only have general category it's not an issue. But if we consider many levels, the normal way
of looking at this is that every level in a way of abstraction contributes to a certain specification
of certain features. Also the common way of looking at it is that when you understand what are
the features of an object in a certain sub-tree or a conceptual hierarchy, and one looks at it the
way that hierarchies further down the line inherit the same features as categories higher up the
line. Also normally features are not normally spread evenly across the abstraction levels, most
features are grouped in the mid-range of the conceptual structure, and this kind of mid-range is
termed the basic level. Of course it all depends on what you want to classify and how you
arrange your structures. But anyway it doesn't really matter how you do it because depending on
how you do it, it's possible always to say that some level is basic. And a very simple example
you can see here - so if we look at various kinds of fruit trees, in the middle we have trees like
apple trees, peach trees, grape trees and so on. More abstract is fruit trees. More specific are
specific kind of apple trees like Mackintosh trees, Delicius trees and so on. And then it turns out
that if you think about what kind of characteristics you can attribute to each level, you will
normally find that it is more natural and easier to attribute categoristics to what is here termed
the basic level. Of course again with the disclaimer that is the basic level is something relative to
the way you set up the category structure.

So, we reached the end of this lecture. Thanks again for your attention. The next lecture will be
on the topic “Feature Related Issues. Bye and thank you.

38
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 8
Feature related issues

Welcome to this third lecture of the second week of this course on machine learning. This
lecture will focus on features and issues related to features as you may already have
understood in the earlier lectures, Feature engineering that means the way to decide on the
appropriate set of Features to characterize objects within the Data set is a crucial issue for
machine learning. So the idea here is to try to characterize the concept learning tasks, in terms
of the relevant Features, Feature vectors and what one can call the object or Feature space. So
the classical way of viewing a
scenario for a learning task is to define an appropriate set of Features, view each Data item in
this data set as a Feature vector consider the feature or Object space spanned by the features,
populate the feature space with the Feature vectors or Data items, find optimal multi-
dimensional surfaces,
Hyperplanes in the object space that circumscribe the extensions of all the cost concepts
involved. The engineering of features is crucial for the complexity of the object space and as
a consequence also crucial for the complexity of their learning problem. We want to
distinguish three basic cases
for Feature engineering : In Case 1 we have a reasonably well composed set of Features given
based on domain theoretic considerations. In Case 2 we have a huge set of possible Features
available, but these features have to somehow be reduced to a manageable size that can
ensure an efficient learning process. The third case is that we have data items that are of non-
digital nature it could be images, it can be sound, it could be other forms of representation
and in that case the relevant features need to be extracted from the primary form of the data
items as a separate process, typically this need to be done in a case-to-case fashion depending
on the nature of the primary form and the character of the application. So now we will discuss
the first and third case shortly, and then
more or less focus on the second case for the rest of this lecture. So now I'm going to discuss
an example to illustrate the character of the first case and this example is fetched from a
systematic example introduced in this week, the zoo dataset. So the problem in this case is
neither a volume problem due to a large ungraspable set of possible features, because we talk
about a limited set of features neither it is a representation problem caused by data items in
non-digital form, because we
assume that we can define the features along we can have digital values of those features. So
then the question is what are the problems here so, so if you look at the two columns in this

39
example you can see the feature that is the original Features of the data set from the standard
repository fetch from the standard repository. In the second column you can see number of
features that are inferred from the conventional zoological taxonomy used to characterize or
classify animals. And the set of features indicated in the second column is derived more or
less exactly from the complex taxonomy ranging from animal down to a specific kind of
buffalo. So as you may see in two cases there is a correspondence, so the same the same
features occur, in other cases you can see that there are differences. Obviously there is a
reason for differences because in the second column the taxonomy is the taxonomy for a
specific line in the animal taxonomy leading down to buffaloes and related animal and of
course the optimal set of features related to that line is necessarily not optimal for all kinds of
animals like fish and reptiles and birds and so on. But it's not a trivial thing to say whether the
left column is the optimal one for the problem at hand or something like the right column
may be more correct. So the bottom line here is for this kind of case where we have a
reasonable amount features, for a reasonably well-known domain we still need domain based
sanity check of the features in such a case. It's also very important that we have a
terminological consistency here, that we use the same term it cannot be the case that we use
one term like in in in one of the columns here and another in the other and make a mis-
judgment with these are the same or different and so on. So the same characteristics need to
be referred to in the same way in in any proposal competing proposal for the correct feature
set this is more like common sense. But also it's very important that there is a clear feature
definition for every candidate feature, so for example we have a splendid example of that I
mean in in the first column in the bottom we have something like cats size which is
absolutely not self-evident was it means actually looking at the data set cat's size means
actually that this kind of animal have the size of the cat for larger, so this is a Boolean feature
that'll be the very one if that kind of animal is considered being equal to or larger than a cat in
size. But it's just an example that just by ranking a symbol like cats size is not evident what it
means so homework here in any feature engineering task is, first of all see to that the
terminology is so crystal clear I don't know it was the same term for the same kind of feature,
and also that every feature is well defined, that's a starting point and still of course be many
other issues to consider at this point I just want to mention these few. So the third case is the
case where the data items of our data set are of heterogeneous and non-digital nature and we
need a separate pre-processing to go from the initial form and map that form into relevant
digital features. And this can be done in in several ways, so just look at the example here
there are four images, so either you can have a fully manual process where a person looks at
each image and infer a feature set for that image, so for the first the inference is that we talk
about birds, in the second image it is a fishes, in the third there are mammals and in the fourth
there are insects and when you go further there are so many birds of a certain size there are so
many parts of another size for the Fishes we can do the same, so we can do a manual analysis
of each image and from that image decide upon a description of that image in a digital form
in terms of a manually produced features. The other extreme is that we have a totally
automated process where a computer program a computer vision enabled program manages
these images and automatically infer the most likely set of important feature characterizing
that image And that could be things in between, that could be automated but still need to be
some human intervention in this process of course in the end we want a description of these

40
images in something equivalent to the kind of feature set up we really looked at for the case
number one. For sure every non digital form of representation demands its own analysis
here, we cannot believe that we can use the same techniques for any form so for images we
need certain techniques for sound we need other technique and so on in order to enable some
degree of automation for this kind of case. So now we turn to case two, Dimensionality
reduction or Feature Reduction so in this case we have a large set of possible features and we
want to reduce them to a manageable size, so in most realistic cases the amount of possibly
available features can be used to characterize data items is overwhelmingly large, in general
we want to reduce the number of considered features the ground for removing the specific
feature is, that it may be either redundant or an irrelevant and if so can be removed without
causing loss of information the goal is obtained is to obtain an adequate set of informative
relevant and non-redundant features still being able to describe the available dataset. The
underlying motivations for dimensionality reduction can be summarized as follows: At 1
making models easier to interpret by humans, the smaller set of features is more easy to grasp
for a human when looking at them all.
Avoiding the curse of dimensionality and we will come back to what that could mean. Third
reducing the risk of overfitting and we will also go further into that and finally in general
shorting the computation times for learning processes. Naively you can say that the more
features we have to consider, the more costly the computation will be, so the term Curse of
dimensionality refers to various phenomena that arise when analyzing and organizing data in
high dimensional spaces, we talked about hundreds or thousands of dimensions problems that
do not occur in low dimensional settings such as the three dimensional physical space and
others. The term was coined initially by Richard E Bellman when he worked with problems
in Dynamic Optimization. The common theme of the problematic phenomena is that when
dimensionality increases the volume of the space increases so fast that available data perhaps
sparse. This sparsity is problematic for any method that requires statistical significance, as the
amount of data needed to support a result often grows exponentially with the dimensionality.
Also organizing and searching data often relies on detecting areas where objects from groups
with similar properties unite, however in high dimensional cases all objects appear to be
spares and is similar in many ways, which prevents efficient data organizations. So now we
turn to what is termed Overfitting vs. Underfitting. This case is being two of the phenomena
most frequently affected by wrong or inadequate selection of features. So Over-fitting is the
production of a model that corresponds to closely or to exactly to a particular data set and
may therefore fail to fit additional data or predict future observations reliably. Typically an
over-fitted model is a model that contains more features than can be justified by the data set
and by the existence of all these features the current set of data is to exactly fit it. In contrast
under-fitting occurs when a set of feature cannot adequately capture the available data set,
typically an under-fitted model is a model where some features that would normally appear in
a correctly specified model are missing, and as for a over-fitted case such a model will also
tend to have a poor predictive performance. So in the example above, you can see two
dimensional examples how it could look like but as you understand as for many other
examples were given and the same can occur phenomena can occur in multi-dimensional
cases. So finally we turn to the two concepts of Feature Selection and Feature Attraction. The
concept of feature selection is pretty straightforward by being the process of selecting a

41
subset of relevant features from the original set and discarding the rest. The three main
criteria for selection of a feature are: One, how informative is the feature, The second: how
relevant is the feature and thirdly is the feature non-redundant. Actually relevance and
redundancy are could be thought of as equivalent but not so because it may be so that two
relevant feature are so similar or overlapping that one can consider one of them as redundant.
I mean in general what we want to achieve is to have a few features as possible as long as we
can discriminate in a good way among more data items in the data set and the categories in
question. So features extraction is a little more complex it's rather the process, it's not the
process of selecting anything or throwing something away, it's rather the process of deriving
new features could either be as a simple combination of the original ones or it could be as a
more complex mapping from the original set to a new set. When we started to talk here about
always reducing the number of features but actually it could be so that without really
reducing the number, one could create a set of more suitable features that simplifies the
learning tasks so feature extraction in that way kind of more general because it captures all
kinds of mappings from one original set of features to a new one, given of course that the
new set is more useful for the learning task, so in both cases the learning touch is supposed to
be more tractable in the resulting feature space than in the original independently if you use a
selection procedure or use some extraction process. So this is the end of this lecture I want to
run summarize by saying that I hope now you understood that feature engineering is the key
ingredients in the area of machine learning, and that all the cases mentioned in this lecture are
relevant, both the case where you have non-digital data items that have to be transformed into
some digital form with a discrete set of features also the case where you already have a
reasonable number of features but where they have to be judged in terms of domain relevance
and in the light of a domain theory, and thirdly when you have a huge set of discrete features
that have to be reused in order to be more optimal for the performance of the learning
algorithms. So by this I want to thank you for your attention the next lecture will be on the
topic of scenarios for concept learning so thanks and good bye.

42
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 9
Scenarios for Concept Learning

So welcome to this fourth lecture of the second week of the course in machine learning so
this week as you know we try to characterize learning problems, and on this lecture we are
going to look at different scenarios for how learning can take place. So I will start with two
basic distinctions, so the first very important distinction in machine learning is between
supervised versus unsupervised Learning.
So in supervised learning all input data is pre-classified by the use of unique concept labels,
so it may either be so that we only try to learn from data items, from one class but it could
also multiple classes. And the goal of supervised learning is to based on this these pairs, the
input data and the learn a concept definition which best approximates these relationships. So
an optimal scenario will allow the algorithm to correctly determine a concept label for a new
unseen data item that occurs. In contrast in unsupervised learning no data are classified, this
means that the data set contains only whatever in an earlier lecture called Predictor Features
or Predictor Variables and there are no constant labels. So unsupervised learning algorithms
therefore have to it identify commonalities among all the data items and find structures in the
data set, and then with that as a base, try to group the data items based on some kind of
similarity measure. Unsupervised learning algorithms have to decide on optimal portfolio of
concepts there could be many variants there that best match the data state at hand and then
arrange groupings of subsets of the data set so that those groupings match the portfolio
concepts. So these are the two main situations classically, a lot of work and I'm still so goes
on in supervised learning where we really have a lot of data but the data are labeled and the
only problem we have is to find reasonable abstractions that captures the character of the
data. Of course when we're going to use machine learning in a wider range of settings and for
example in self-driving cars we cannot rely on everything being classified all the time
because we will have continuous data and so on. So unsupervised learning becomes much
more important over time, one should realize of course that the unsupervised learning
problem is a much more difficult problem than supervised one. Let's turn to the second
distinction which is between what we call offline learning (Batch) learning and online or
Incremental learning. So classically most of the work in machine learning has been done
offline for with the use of static batches of data fully available at the start of the analysis but
increasingly so, we are interested in looking at continuous flow of data items over time and
being able to analyze those data when they come. And of course there are some middle

43
ground here the initials cases here are extremes, we could also consider to look at smaller
batches of data that we still have a slower data but we partition that flow and handle each
mini batch at a time. So there are all kinds of such very variants possible. The distinction
being offline and online is relevant both for supervised and unsupervised learning so one can
say this these two distinctions are orthogonal to each other. Yeah concerning offline learning
we talked about the situation where the system is not operating in a real-time environment but
handles pre-harvested the data in static and complete batch forms. Most traditional machine
learning algorithm are well adapted to offline learning and the parallel access to the whole
data sets give full flexibility in playing with the use of the data items in all kinds of variations
to optimize the learning process. So in contrast Online learning is a learning scenario where
data is processed in real time, in the full incremental fashion, input data incrementally,
gradually and continuously used to inductively extend the existing model. Normally results of
earlier learning phases are typically maintained and regarded are still being valid. Incremental
algorithms are frequently applied to what we call data streams or big data and we see more
and more of that examples are stock trend prediction, and user profiling of such data streams.
Many traditional machine are learning albums inherently support incremental learning but
may have to be adapted to facilitate this in practice. So to sum up the two distinctions
traditionally, most machine learning was offline learning of supervised character while we're
now moving into a time where we increasingly need techniques to handle the online
unsupervised case. We are now going to go through a number of scenarios of learning of the
supervised and unsupervised kind, of the online and on the offline kind. So we will start from
the very most simple situation which we describe as learning a single concept offline from
pre classified positive examples, and the picture you see is a way to depict this situation
where you have objects of the same kind, all labelled in a consistent way and the tasks where
this hand is to formulate a description of a concept that in an optimal way covers the data
items considered. And there are no other elements around in this case. So now we turn to
scenario two and actually scenario two is more or less scenario 1, the only thing we
introduced there is something very important but also very problematic.

So we look at the first case but we acknowledge the possibility of the presence of noise. So
noise is a fundamental underlying phenomenon that is present in all data sets, it's a distortion
in the data, that is unwanted by the perceiver of the date. So noise is anything that is spurious
and extraneous to the true data and typically the noise is due to a faulty capturing process.
And as you as you see from the picture here this may result in that there are objects actually
outside, objects that should be clearly inside the concept definition here are outside because
they are noisy but there were also maybe erroneous data elements that are clearly within the
frame but they are erroneous and shouldn't be there. So there are no of course any anything
can happen with noise and any algorithm that we encounter need to defend itself towards
noise that means they have to be always mechanisms to fight noise, there could never be any
hundred percent guarantee but at least there must be defense mechanism, and as you should
understand noise can occur in all scenarios to come in the same fashion that's well in this first
very simple one. We now turn to scenario 3 and actually scenario 3 is related to the last
scenario where we talked about noise. So here we will talk about Outlier. So an Outlier is a
data item that is distant from other observations. And another may be due to natural but

44
extreme variation so it may be quite okay or it may indicate an experimental error or other
noise. So of course outliers the handle outliers is a challenge and is of course crucial to
distinguish between the measurement error cases and the cases where the population as a
heavy tail skewed distribution. And as you understand when we have we are informed that
we have the latter case the natural case, normally in the concept definitions becomes much
more complex in order to see to that the definition is such that it also can covers these
extreme outliers and the same comment goes here ask for a noise for any more complex
scenario to be discussed when we can always talk about Outliers and way to handle them. So
for scenario four we stick to the learning of single concept, yeah what we will discuss here is
the role of Negative examples. So in the first scenario we only considered Positive examples
but for many situations and many algorithms we may shield with faster convergence towards
the appropriate concept definition if we feed the algorithms with a mix not only or positive
but also negative examples. And of course negative examples can be made available in a
variety of way, we may see later when we look at learning of multiple concepts that of course
we will have examples of the different classes available in parallel classified so therefore we
can use the examples of the other classes as negative examples for one of them. Yeah or it's
also possible if you have some domain knowledge available that gives information of the
character of examples we could also then artificially generate negative examples and feed
that into the learning algorithm. So let's go on, let's talk about what we call your scenario
five. So scenario file is actually an interesting variant of the use of negative examples which
we refer to as Near Misses. So the question is when we have this situation where we have to
do we want to use negative example to constrain the learning process to make it more
efficient, oh it's not obvious what type of obvious we should use because there may be any
alternatives and negative examples may be very far and differ considerably from the positive
examples, which then doesn't help us much because there are too much flexibility still for the
algorithm to determine the classification boundary in the light of negative examples that are
very very far from that boundary. So the idea here is to use negative examples that differ
from the learning concept in only a small number of significant ways. So as to picked it here
in the small picture you see the negative examples kind of cling to the rim of the concepts and
so these kinds of negative examples are very close we will refer to as Near Misses and as I
said in the earlier slide for negative examples there are various ways of finding these not
necessarily they may belong to the neighbouring concepts but they could be also sometimes
artificially generated. So in scenario six that we were going to do now we will turn to another
aspect actually of the dataset. So in we still talked about the same basic same kind of learning
situation but in the earlier scenarios we have assumed no Internal Structure of the data set
which means that all data items are been regarded as having the same status and importance
and no structure or metric has been assumed among the data items in the set. And so because
of one of that is because that they are all the same they can be handled in any order it doesn't
matter in which order we treat the data in our analyses because they are all first class citizens
and considered equal. So in contrast to that and we will now look into the situation where we
really don't see them as equal that we actually first of all can say that some of the objects are
like better or more typical than the others they are better representative from the concepts so
there are better representatives of these concepts and less good members in the sense of
typicality and of course then naively the more typical objects would be more advantageous to

45
start with in the learning process because you will start with those then then you get a kernel
of a concept definition that is more stable than otherwise. Also in this situation we assume
not only that objects are more or less typical but there is also some internal structure among
the data and we can also talk about some measures of similarity between the various pairs of
objects. And if we have such a similarity measure it also means that we can let the relation
between the objects guide the order of considering the examples. So essentially the use of the
would be to optimize the learning behaviour by the enhanced knowledge of the internal
structure of the dataset. So let's turn to scenario 7, so the six scenario the last scenario implied
a well-defined structure and similarity metric for the datasets with some tip typicality of
objects actually this kind of situation open ups for a new scenario which is this one we're no
longer we may need an explicit generalization our concept definition to be defined. So by
Instance based learning or memory based learning we mean learning algorithms that instead
of creating explicit generalizations, as the basis of making a prognosis in incoming cases, we
just compare new problem instances with all the instances we already have and which we
have already stored in our memory but of course in a structured way. In principle in this case
of course the negative side of this is that when we have to get a new case we have to compare
that with although one we match it against some abstract a general concept definition we
have to compare it with all the into the store objects theoretically and this is computationally
hard, however there are also advantages because the instance base model may typically have
more it much more easy to adapt to the model to new data , so in this way one can say that
the model is more agile because it can really change much faster with with the handling of
new data and actually this kind of model then typically store each new instance and introduce
it into their existing structure and in some cases and after some consideration one may also
decide to throw away old instances because they may have been become obsolete. So in
Scenario 8 we still keep two supervised learning but leave the offline case and go into the
online case. So online learning is then a learning scenario where data is processed in real time
in an incremental fashion. Input data are incrementally, gradually and continuously used to
inductively extent than the existing model. So we still have one concept definition that we
form and the picture here maybe is a little clumsy but you should interpret it that the sum of
all the green circles are considered as the as the dataset so far, however we kind of harvest
our data items in a time wise fashion and we treat in the data items of the data set when they
come but, at least theoretically nothing is worse because it come earlier or later, they're in
way equivalent in that sense, so when we treat new data items results were you learn entity
maintained, so we know we can of course which was discussed in some of the earliest
scenarios, decide to discard things for various reasons, discard data itself but not necessarily
because of where they come into the flow o finput in this case. So actually the next scenario
number nine focus exactly on this problem of what to do with data items that doesn't seem
relevant anymore, and I mean we can say that in the continuous case we really can have more
extreme situations where first of all the amount of data is gigantic this means that there are
technical difficulties in storing and handling all the historical data. Also it can be so that this
goes on all over a very long time which means that things that happened years ago still are
not relevant by know. So in this extreme cases with a lot of data and very long time span it
may be cleary so that you want to throw things away. And of course, when you throw things
away you may also need to revise the concept definition because of course the concept

46
definition you have is based on all the analysis done so far so even if you analyze it some
things the years ago it's still part of the definition but if you really think those old data is not
relevant anymore then you have to revise that also the concept of the definition not only
throwing away the data. Here you can see if you remember some of the earlier scenarios like
instance based learning is that if you don't have a concept definition that's it's of course not a
problem, so in that case you only have to have a throwaway so well that's one of the positive
things with the instance-based learning if you really systematically want to tackle the
situation with non-relevant data. So of course any algorithm that seriously and the online case
must have some options for the discarding of objects and also rather in the definition of
concept definition. So scenario 10, actually now moves from learning single concepts to
multiple concepts and actually the scenario does not introduce many any big differences or
surprises, I mean because as long as data are labelled, we in a way only multiply the problem
and can handle it in parallel using the same kind of approaches so actually all aspects
introduced in the scenario 1-9 are still relevant. So we still have to sound the handle noise, we
still have to handle can must consider outliers, we can consider the negative examples for all
the classes, we can look at near misses, internal structure is an issue, instance-based learning
means an issue, online learning of course can take place also here and the question whether
we should keep everything or discard or modify during the process is also still there. And so
let's say that scenario 10 is like the complete supervised scenario with all its ingredients. So
now we switch, we move from the supervised scenario to the unsupervised scenario. And
now we look into the case where input data are not classified that means the input data only
contains predictive features and no classification labels. So Unsupervised learning algorithms
therefore have to identify commonalities and structures in the dataset and the group the input
based on similarity. The main category of techniques that tackles the unsupervised case is
called Cluster analysis. And maybe it's also it's not obvious to you that when we go into this
realm, it becomes much more important to consider the issues of internal structure of the data
set as brought up in some of the earlier scenarios, because if you don't really look into the
structure of the data and similarity matrices among the data it will be very difficult to handle
this this unsupervised situation. So therefore this kind of scenario of course in inherit most of
the aspects in the earlier scenarios but in particular it is very important to consider the aspects
of internal structure typicality, similarity matrices and so on of the data items of the data set.
So even if we get back to these topics in coming weeks I added an extra slide here on cluster
analysis, so which is the primary techniques one of the primary techniques to handle the
unsupervised case. So actually cluster analysis then is the assignment of a set of observations
into subsets called Clusters because when we start everything is unsorted, and so to see that
the observations the data items within each subset are similar or more similar, than the
distance is to objects to other clusters while observations drawn from different clusters are
more dissimilar. So of course similarity metrics is a very important thing we have to have a
possibility to measure the distances between data items. But then we will start to talk about a
lot of topological things so actually it becomes very interesting to talk about the compactness
or density of clusters, the degree of separation between them and so on. And so this is a
whole new realm of its own. So actually I have put this slide here just to point at an important
fact but maybe a trivial fact, is that when we now start to look at the potential groupings of
unsorted data and look at clustering where we're trying to find optimal categories, it's not

47
only categories on one hand, so what was said on an earlier lecture about categories
detractors is extremely relevant here, and of course any algorithm in the realm of
unsupervised learning and clustering also mean potential mechanism not only to handle a flat
category structure but also hierarchical structures. So in the end of this lecture now I take the
liberty to rely on a little of learning by repetition. So in an earlier lecture I gave you the more
or less this slide about the end-to-end process for concept learning and I think is very healthy
to look at that now and then, because for most of this lecture we really have focused on the
core data analysis phase again which is of course the core of machine learning, but we should
never forget that the scenario we have for the analysis phase is very much dependent of
course on all the earlier phases it's so much dependent on the kind of data we can acquire but
we can manage this data how with how we can what theory we can establish, how we can
engineer the features of the data items and so on. So putting everything we do in this context
and have it always in the back of my mind is it's very important, because if we focus too
much just on the core data analysis face, it is very difficult to keep this this big picture. So
this is the end of the lecture so thanks for your attention this time now only remains one
lecture for this week and this is the rather short tutorial for the assignments of the week thank
you good bye

48
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 10
Tutorial for week02

So welcome to the fifth and last lecture of the second week of this course in machine
learning, as normal the last lecture of every week is more or less an introduction to the
applications to the assignments that I have assembled for that week. So I have two comments
at this point, the first comment regarding the character of the assignment, so our two
objectives with the assignment so
apart from the first being the vehicle for C assessment for you my intention is that the
assignments should be a way for me to re-emphasize a few of the items and that I've been that
I touched during the lectures of the week this goes more or less for both groups of questions
as you already seen the first groups are our questions that more or less relate to material
directly given in the lectures, while secondary group of questions relate to material provided
by some of the suggested readings. So the second comment I want to give at this point is
actually that I want to encourage you to also look out Winer with respect to a number of the
items or concepts that I touch it's so easy today to Google on anything, sometimes you can
think that whatever you say people will Google on it and actually I really want and to
encourage you to do so, because the what I try to do with the course just to give you an
overview, I hope I will be able to systemize things in this area in a good way for you and but
of course there is an endless amount of material that you can find on any of the key topics
that I touched and it's probably very advantageous to look around a little and extend your
readings on any of the subjects except subtopics that I have touched so please do. So let's turn
to the questions then and as always there are two groups, one with questions based on the
video lectures and group two with questions based on some extra material so let's now
immediately turn to Group one. So actually the questions themselves I will let them stand for
themselves, so what I will do at this point I will take the chance to mention or highlight the
sub topics that I treated in the lectures and just make a short comment on them. So the first
question relate to this discussion I actually had more or less in lecture two, about the
importance or have a clear understanding of what an object is, what your feature is, what
establish an object space, what establish a hypothesis space and what characteristics these
kinds of spaces have which is a central topic for this week. The next question number two
relate actually to lecture three the lectures on feature engineering and which is of course
important and obviously I think so because I've given it a separate lecture, so this is a
question that highlights feature in engineering, actually there is a fourth another question
question four that also partially relates to the selection of Features as you may remember that

49
feature extraction or selection or anything related to feature in the learning has a really a big
impact on the forthcoming learning process when there is the risk for having phenomena like
over fitting. And as you may remember earlier from the lectures and I also made a distinction
between regression and classification, and actually we we've been delving mostly on
classification so by question three I was want to highlight again and the importance of the
concept of regression. Finally question 5 is there to put some limelight on the discussions I've
had during primarily lecture lecture 2 on the conceptual hierarchy which means by which is
meant looking at concepts on various abstraction levels. So let's turn now to questions in
group 2 and actually the questionnaire I given you relates very closely to the five different
articles or papers that I recommended for further reading, and so actually the reason for
including reference one is that there are many traditional techniques in applied mathematics
that has a long history but are still very valid and very useful in machine learning and actually
dynamic program is one of those techniques, it's a pretty broad concept but is as a wide range
of applicability. And I want to include that to highlight the importance of this technique but
also in general the importance of bringing forward another traditional mathematical looking
into machine learning. So the second reference here is more related to the discussions I've
had on conceptual structures, and in particularly conceptual hierarchies and how one looks at
those and how one can distribute features in different ways in those conceptual hierarchies.
The third reference is I would say an Outlier if you remember that concept so it's an Outlier
on this week because we haven't really talked much about it but obviously there is a need in
all areas not much in learning is not exception for more theoretical work for more theoretical
frameworks for understand what it actually means to learn in various contexts. And there are
some work of that and this is a reference to one of the more well-known contributions to that
area. And the fourth reference relates to the item brought up in lecture four, so actually it's
there to highlight that in practical work and these concepts as was touched in lecture four our
often used and referred to so this is a work about Near Misses and how Near misses can be
used in a learning process. Finally the last reference and is a reference to a specific technique
actually support vector machine and it's a technique that we will come back to, yeah it's put
here because it's a representative for one of the techniques or issues brought up in lecture 4
that is instance based learning. So let's please see that as an example of instance based
learning on simulated based learning which was one of the topics of lecture 4. So this was the
end of the last lecture so thank you very much for your attention, the next week of the course
will have the following theme forms of representation. So this means that for the coming
week we will leave for a little while the central issues of learning and look more in detail in
the various forms of Representation used traditionally in artificial intelligence of course we
will I will relate here and there all the time to the use of these Representations in the machine
learning situations so thank you for this week and good bye

50
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 11
Forms of Representation

So welcome to the first lecture of the third week of this course in machine learning this week we will
focus on representations. So first a few words about representation I mean this is a course in machine
learning so you can wonder why should whole week on representation. So I want to show you this
triangle of three abstract concepts that are really the key, the backbone or concepts for any kind of
work in machine learning, so we have the concepts of representation, we have the concept of problem
solving and problem solving it's a little matter of taste what term you like I prefer problem solving
some people would say reasoning other people would say decision making, but it doesn't really
matter. And of course the representation part as such is more the core declarative part of the system
but the representation in that sense is only there because it should support problem solving, so this
means that when we talk about a representation it's absolutely necessary also to talk about that kind of
problem solving that representation. So this means you will see that during the week you know that
we will have these so called representation is use but essentially when we talk about it if we in the end
be a mix or pure declarative aspect and more problem-solving aspects. Also of course because it's a
machine learning course it's an important issue what would learning mean what form would machine
learning take in that specific form, and this means that in several of the lectures this week we will also
touch the learning aspects of these representations. So I want to comment on something that I am
personally find very important, so the message here is that computer science is of course an
engineering discipline and artificial intelligence therefore our is a engineering discipline as well. So
actually the choice is of representation and problems of these skills are and should be based on many
pragmatical engineering decisions and of course the same domain and problem can be modeled in any
of the alternative form of representation and problem seems theoretically, not necessarily it is as
convenient to do it in all these forms but it could be done. Also when we work of a specific domain
we may use not only one of these schemes we may use several of them and we may map what we
described in one's scheme on to another scheme, because there are many purposes here one purpose is
to stand what we are doing is to have a kind of representation that is easy understand for humans for
people involved in the whole process of working with the domain, the other aspect is that the

51
representation should be a convenient base for an efficient implementation, so it may be so for
example you will see that later that we can have a neural network representation which is then
mapped totally on an array representation and then the computation which will happen in the array
representation. So problem solving a machine learning for domain may be more feasible to implement
in some of these schemes while some other forms may be useful for other purpose, so the bottom line
here and the summary of this line is that nothing is absolute here, everything is relative
representations and problems on these schemes are too in our toolbox and we have to pragmatically
choose the most suitable a combination for a particular problem area. So of course one always have to
make choice what is the most relevant sub team to focus on and I've made a decision to spend most of
these weeks this week on five kind of representations and problem-solving schemes and the fives I
have chosen are Decision trees, Bayesian networks, Neural networks, Genetic algorithm and Logic
program.

So there will be one lecture on each of these things. So for each of these lectures I decided to upon a
structure that I would will try to stick to, so for each of all of these representation schemes I will talk
some about something general about the general characteristics about them, I will talk also a little
about the kind of sources of inspiration, how they come to be developed, I will try to describe the kind
of core components and in most of the cases also say something about structural aspects of the
representation, finally I will look into what kind of problem solving is typical for this representation
and in the end comment on what could it mean to do machine learning for this kind of representation.

So even if I did choose five sub themes for separate lectures there are other sub topics too that is
worth mentioning these subtopics has typically been very very important for working machine
learning and indirectly these subtopics is relevant and can come into play in in machine learning it
wasn't really possible or relevant to make lecture of each one of these, so what I will do is that I will
now shortly comment on these sub themes, I will start with say something about List structures and
List processing, I would say something about Arrays, I would say something about Graph theory and
Graph search, I will talk a little about Semantic networks and finally I'd say something about
Production Rule systems.

So starting with List structures and List processing so which has been for a long time the most
established formal representation artificial intelligence, the cornerstone here is to look at expressions
in the form of lists where a list can be either as a number of atoms but they can also be the elements
can also be lists recursively, so typically what you end up is with a gigantic structure of nested lists,
the role and interpretation of the atoms and lists in various position in these structures are typically
defined for each domain, this creates a great flexibility to design any kind of system you like, but all
the problem is of course that the language itself like some semantics that can be shared across
domains, the problem-solving paradigm coupled to lists is functional programming and that's the

52
computational model used for computing List structures in artificial intelligence and computation
here is viewed as a recurrent application of mathematical functions so you can see it does that you
have recursive structure of functions working on our structure of lessons. The declarative
programming paradigm program is done with expressions instead with statements, were typically
called statement programming and there is a very long history for this functional program is based on
work in the 1930’s where something called Lambda Calculus was invented. So the most well-known
language in this List is LISP which is the second-oldest program and still used and LISP has been
used extensively for all kinds of applications and many times even if you start with I started with
another kind of representation semantic Network production rule system and so on and you have in
the end mapped those representation on to pure List structures and this base functions. And so in the
end you can see an example of that in that chong'er where a very simple semantic network has been
expressed in a list format. So now let's look at one form of representation that is very basic for most
programming paradigm and still also very extensively used in many machine learning application and
this form of representation is Arrays.

An Array is well-known phenomena it's a it's an ordered n dimensional collection of items, where
these items or elements can be identified by one index or synonymously by a key, many times the
word table is used synonymously for array and if you conceptually have a what you think is a table, it
is very straightforward to map that onto some kind of array. Then there are some other concepts of the
term vector is easier to refer to one dimensional array, also the tupple would be a better word more
mathematically correct because the vector is identity of its own in in mathematics, the same way for
the concept of matrix can be represented as a two-dimensional grid on and you could talk actually
about two dimensional arrays as main matrices even is also matrix is a separate kind of mathematical
entity really related to linear mappings. So and finally there is a certain concept that occurs in the
machine learning lecture it is the word Tensor and also tensor is a special kind of mathematical entity
but you may decide to talk about the multi-dimensional arrays three four five I've mentioned arrays as
tensors. And I would say that even for many of the representations we will talk about in the coming
later lectures many times in the end, when we in computation should take place these representations
are mapped onto some array structure in the purpose of the basic computation in the form of array
processing. So this second slide is just an illustration of the last actually showing the in some
graphical form what is typically meant in the literature by a vector matrix and tensor in the sense of
being different variants of an array with the various dimensions.

So now let's turn to another very important topic actually in mathematics graph theory is a study of
graphs which are mathematical structure used to model pairwise relation between objects. A graph in
this context is made up of vertices or nodes which are connected by edges or arcs. Actually graphs is a
very powerful tool for modeling and also being a basis for various kinds of implementation both for
problems and machine learning. So knowing something about the terminology of graphs and the

53
properties of graphs is a background knowledge that that makes a lot of sense and is virtually
invaluable in a work in this field. So apart from having a basic structure there are some distinctions so
grass can be undirected or directed, the directed means that that the edge has a direction it can also be
so that there are some numerical measures coupled to the vertices that you can typically call them
weights, there are also issues like connectivity where there all when there are paths between specific
pairs or subsets of nodes, there's also issues of whether there are cycles in a directed graph or not and
there are more specific concepts like bipartite graph and certain distinctions so to know something
about the issues of graphs and the terminologies related to graphs is very important because it's
commonly referred to in all kinds of contexts that we will go into. So coupled to graph theory you
find on this line a list of so-called the classical graph theoretical problems ranging from the old
problems seven bridges of Königsberg that could be of course he easily mapped onto a simple of a
graph, the Traveling Salesman problem various problems where you find to find paths with certain
properties for example a path that visits each vertex exactly once or path that uses every edge exactly
one onto problems whereas you to find subsets of the graph which spans all the edges. These
problems may be known you may not view them as directly relevant for either artificial intelligence or
machine learning but indirectly improve many problems of solving situations these variants of these
issues will occur so to say in disguise because many times when you model your problem you even if
you express herself in graphs of terminology or not essentially you handle your problem as if it was a
crafty or a theoretical problem and therefore further using Graph Theory will occur. Closely related to
graph theory is Graph Search and actually Graph search has from the start being core methodology in

artificial intelligence and of course machine learning as a sub area inherit the dependence of these
techniques so many times the problem-solving will take the form of search, learning will take the
form of search and therefore the vast repertoire of search strategies that has been developed is of
relevance and on this slide you find a kind of overview of the kind of strategies developed
historically, I will not go into great detail because this is not the main topic of this course but
obviously when we come into the a later algorithm or algorithmic parts later in this course reference
to search techniques, will be imminent. So essentially as you can see there are three categories of
search strategies with one category called Brute Force which is essentially that they are uninformed,
the second group is here called Informed which means that they are guided by some domain
knowledge or heuristic knowledge to proceed and the third category is termed local search. So within
each there's a main categories you have the specific types of algorithms so my recommendation is that
you may look into some of these detail algorithm yourself because that will be a lot of references
coming on if you are not already have a good knowledge about these issues.

So now we leave the more formal matters and look into two kinds of representations that has been
used a lot for practical problem-solving historically and one of them one of these is Sematic or called
semantic networks or synonymously conceptual graphs. So semantic network is a graphical formalism

54
that represent semantic relations between concepts beside directed or undirected graph consists of
vertices which have said concepts and edges which represent semantic relations between concepts. So
the edges can be labelled in a bigger ways which you will see in a coming slide but there is a small set
of standard age type that occurs frequently in many of the other cases developed. So IS-A was a kind
of relation which corresponds to what was described on an earlier lecture corresponds to a
generalization relation in a taxonomy. HAS it's a standard attribute feature relation so for example an
object has a property. PART OF is what you call a component relation which means that one kind of
entity is actually physically part of some other entity. Finally you have a certain kind of Causal
Relations related to actions or events. A drawback with semantic Network is that there is no agreed
upon notion on what given representation structure means or some formal semantics as there is no
logic. So here you can draw a parallel to the list representations in in lists for example where they're
also real everything is decided from case to case so if you go from one domain to another you'll find
syntactically it looks the same but the intention of what is in the representation is the domain-
specific , so therefore it's much more difficult to compare what is done in in different cases. They're
all the ways of expressing the basic semantic networks so this means that an edge not only relates to
concept but an edge can be a network of itself, which means that you can express things like a people
and beliefs something and something is not just a simple fact it's a more complex situation of that
complex situation can in turn be represented by a network.

So on this slide you see a few examples of semantic networks you can just have a look and the
intention is of course to give you a flavour how things are expressed in this kind of formulas. And a
few general messages could be that as you see as was told on the earlier slide on one hand there are a
few standard attributes that occurs in many cases like generalization, relations attribute realizations,
part of relation and so on, while there is also a bunch of relations or edge labels that are totally domain
specific and tailored to the current situation and it should also be obvious to you that if one want to
work systematically and professional with this kind of representation, it's very clear that graph
theoretical issues come into play.

So finally let's turn to something termed Rule-based system. So actually rule-based systems is a kind
of systems used a lot a long time in artificial intelligence to implement what was also earlier called
expert systems, and there are very synonymous you can call a Rule system, IF-THEN Rule systems or
Production systems and so on. The idea with Rule based system is to try to separate out declarative
knowledge from procedural knowledge and the rules specify each possible inference type in a
declarative way and normally there is a very self-simple some simple syntax so there is a syntax
saying there is an IF part with a number of conditions and then there's N THEN part with something
with a list of actions to perform if all the whole list of conclusions they satisfy. The problem with
relaxed system as a sematic networks and also less representation, so on these there is no well-defined
semantics for what actually is intended because you kind of obvious that's the syntax for Rule it's very

55
simple, so actually it's up to you when you design your own based system to do define what you
mean by rule, the intention of a rule so a rule could be that the IF part looks at the situation and have
conditions that evaluates that situation the THEN part is a set of actions acting on that situation, but it
could also be that through the if-then construct models something more close to logic where the IF
part the premise part and the then part is a conclusion part and so on. All of the old rule-based system
shares some core parts so a little longer is a collection of facts and the collection facts you can say
represents the problem statement and then there is a collection root called the knowledge base which
is a which can act upon the facts, so there is a cycle. So there is the computation is a cycle where each
in each step of the cycle the collection of rules potentially work of the facts some rules becomes
relevant because the conditions are satisfied so then these rules fire and there is an inference engine
that handles this matching process of which rules that apply in a certain step and carrying out the act
relevant actions and there are many different kinds of systems and actually you can use rules both in a
forward manner and backward chaining manner I mean forward means that you go start with evidence
and try to infer consequences while in backward chaining you look at consequences and try to infer
what reasons or evidence is most likely to have led to these consequences, and of course always have
to be some criteria for when we have rushed reached a satisfactory solution.

So on this slide you will find a graphical depiction of what can go on in in a rule-based system
actually this kind of picture is not only relevant for production system but for most of the systems we
look at because what is included in this picture is both as you see at the top the relation to an external
situation where the system interacts with that environment with that environment but it's also a
relation at the bottom where you can get again a graphical model how the development of the system
can take place, so of course none of the systems we consider here it will be useful if there is not a user
interface where the behaviour of the systems can be made explainable, and of course we also need
kind of the developer interface where users or humans involved can add new knowledge not
everything can be learned you also have to even if you have a powerful learning mechanism you have
to incrementally and new knowledge to the systems in order to be useful. So the intention of this slide
is only to give you a flavor for how a rule-based system can look like actually this is an example of a
rule-based system that is diagnosed based is for a diagnosis system, so please have a look and
hopefully you get a good picture of how things are typically done in this kind of representation this
was the end of this lecture thank you very much for your attention the next lecture will be on the topic
of decision trees thank you good bye

56
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 12
Decision Trees

Welcome to the second lecture of the third week of the course in machine learning this lecture will
be about Decision Trees. So let's talk about decision trees so on some general characteristics of this
representation so actually decision trees is something that been developed in decision analysis where
tree can be used to visually and explicitly represent decision alternatives and this is making of options
in various situations and typically a decision tree is drawn upside down with its root at the top. Nodes
in the decision represents features, edges represent feature values or feature intervals, which of course
the Feature value and Feature Intervals they can embody decision options. Leaves represent either
discrete values typically than the discrete values and situations or classes. But it can also be that the
leaves are present continuous outcomes, so this discrete case we talk more or less about classification
of entities and situation while in the second case we talk about regression situations. In the decision-
making scenario decision trees are typically used in a purely normative or prescriptive mode that we
can actually set up a tree that is the guide for how to behave and the decision trees then simply
stipulated, so if you get a simple example to the right you say this is a manual for how to act so you
wake up in the morning you look if it's raining, okay it is it's raining, if it's not raining it don't do
anything. Then in the other case you may say always windy it's raining but it we yeah but it's not
windy if it's not windy I can bring an umbrella and so on and so on. So it's a little manual for her to
decide yeah it's a slightly different setting in machine learning because in machine learning when we
do work with decision trees we essentially want to build up at a relevant tree, from data sets collected
in your domain, so is not predefined it's not prescriptive the tree should be more or less true to the
data considered. So the tree challenge here is to design an optimal tree so that the tree makes the best
fit of the considered data items and has the best predictive performance for new data items.

So the interdisciplinary sources of inspiration for this Representation as has already been mentioned is
decision-making theory in economy in business, which have for a long time used simple models of
this kind. It has also been said is that these models are normally defined and they are prescriptive. But
there is an interesting parallel of course because we will see when we build try to build this kind of
trees from data sets and with the aim of having a true picture close to the data set, of course we hope
that even if a person or some persons just this define this tree, the way they do it is indirectly and

57
informally based on earlier experiences from their site this is the normal case. So at some point
learning takes place however it has done informally among the people involved or we do it formally
in the form of machine learning. So the core component and the core problem solving for the
representation is as follows. So when you construct the tree, you construct it upside down, the root to
the top, if you do this manually you intuitively choose some order of features and the features you
represent as notes and from each level from each node you branch the tree by looking at the values so
of course the edge is represent on every level feature values or featured intervals. And finally when
you come to a leaf that leaf represents the discrete or continuous outcome. So that's the build up of the
tree the use of the tree is also straightforward you start from the tree the top and evaluate the values of
the features in the given order eventually up with a leaf with a unique outcome. If we now turn to
learning for this representation which means that we want to build up a decision tree for a realistic
domain with many features and many data items.

So then we consider two kinds of decision tree analysis, one we can call classifications Tree analysis
where the leaves are the classes of the categories that we want to define or regression tree analysis
when the leaves are intervals or real numbers. The challenge is to design a tree that that this tree
optimizes the fit of the considered data attempts and the predicted performance for still unseen data
types minimal prediction error basis is aimed at and it's not trivial because we may have many
features to look at and it's not crystal clear for most domains which is the most important or
discriminatory feature to use first because of course the idea is the to start building the tree using the
features that that best discriminates among the data and there are several approaches that can be
combined in addition to learning techniques so one kind of technique is a kind of proactive which
means that at every point we evaluate the whole dataset with respect to a potential selection of our
feature and we analyse the discriminatory power of that feature using different kind of approaches. So
and typically the way we evaluate is by looking if some kind of information theoretic measures on the
whole situation and their approach is referred to as information gain beta of some an entropy concept
also approaches called Gini impurity measure, but the purpose in general in all these cases is to judge
the discrimination power of the feature. Another kind of measure to take is to first contract the tree but
then at a later stage in the process proven it and there of course also different techniques and criteria
for what is a good way to prove. A third approach is to actually several trees instead of just one in
parallel, so this means that through the process the design process we would we generate not just a
tree but a forest, and then further on in the process evaluate which is the most optimal tree of within
the forest. One criteria that is always important to have in the back of ones head in these matters is the
very well-known principle called Occam's razor but of course if there is a choice of structure you
should always prefer the simplest choice or the simplest excellent tree in this case.

So if we look at what happens to the data set when we fabricate preliminary tree we can observe that
what always happened is that if we have a population at the start with so in this example we have like

58
14 data items in in the root and then we make the first split, use one feature and we split up in in three
possibilities which are the values of that feature. So then what happens is that by doing that we also in
a way split the data set or the populate data set into three boxes. And then in the second step when we
make the second split using some other feature we make a split again and of course it should always
hold that every step is displayed and the sum of the items in the box is created should sum up to data
set which was 14, so as you see here on for the first plate and we divided up in three boxes and when
we take the second step and we don't go further with the third option we go oh yeah that's split further
the two first option but altogether the five from resulting leaves in that case partitions the data set as a
passive whole. Another perspective on what's going on in this process is to compare a preliminary tree
with some other graphical depiction of the feature space, so if we in this very simple case have to two
features and we in a way split with respect to one breaking point for each of these example illustrates
how the three constructed partitions a two-dimensional depiction of this little features set. So as you
can see here there is one quarter of the area represents one option and three quarters of the area that
represents the other option or outcome.

So I will end this lecture by showing you a few examples so the intention with the first example to
show you like a complete picture where you get a whole data set with a number of data items and then
a corresponding tree, whether that tree is optimal we are not going to discuss at this point we will
come back to those issues next week when we look into a variety of algorithm that has been designed
for this kind of problem. So this example is related to the analysis of election outcomes in the u.s.
essentially the features in this example are the outcomes in various states, so this means that the
ordering of features is based on the importance of those outcomes for the final results and as you see
this has been done here in such a way that the tree is reasonably well behaved in the sense that one of
the alternatives the what is depicted as blue here are kind of more dense to the to the left and the other
alternative the red outcomes are sorted to the right. So I give you two more examples here I won't say
much about them just a few comments the left one is a classification example and the right one is a
regression example. So in the in the left one the classification is essentially in two outcomes a child
gets a Christmas gift or a child doesn't get Christmas gift those are the two only outcomes on this and
of course as you see the outcome can depend on certain behaviours of the child and I are no further
comments on the regression tree example, taken you see here how what do is illustrate it earlier that
the population of the or the data items in the data set are on every level distributed among the leaf
nodes at that point. So this was the end of this lecture thanks for your intention the next lecture will be
on the topic of Bayesian belief networks thank you good bye

59
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 13
Bayes (ian) Belief Networks

So welcome to the third lecture of this third week the course of machine learning. The theme
of this lecture will be Bayes or Bayesian belief networks. So starting with the general
characteristics of a representation, a Bayesian belief network abbreviated as BBN has a
number of synonyms you can call Bayes network or Base Model network, or Belief Network
or you can call it a Decision network or some lengthy elaboration like probabilistic directed
acyclic graphical model. So essentially BBN is a probabilistic graphical model representing
the set of variables and their conditional dependencies and as a structure it is a directed
acyclic graph or that DAG. So a BBN enables us to model and reason about uncertainty.
BBN's accommodate both subjective probabilities and probabilities based on objective data,
but we in both cases we talk more on assertive probabilities rather than probabilities based on
frequencies. The most important use of BBN’s is in revising probabilities in the light of
actual observations of events. So if you look at this little example to the right where you have
three variables, rain it's raining or not, there is a sprinkler on or not, the grass is wet or the
grass is not wet. And so one can say there is a this kind of graph represents some causal
dependency, so the rain can make the grass wet directly, the absence of rain can affect a
person to put on the sprinkler, and when the sprinkler is on the grass gets wet. Yeah also so in
a way you can look at consider this kind of network in a forward manner starting from the
course and looking at the consequences, but you can also do it backwards so you can see “Oh
is the grass wet” but is it probably for the grass it is actually wet and then when you get data
on oh but the grass is wet, you can infer probabilities of the likelihood of certain courses
whether it's more likely that the sprinkler was on or that it was really raining. So the
backward reasoning here is more of the end purpose of this kind of representation. So one
key theorem from probability theory is called Bayes theorem this theorem as a fundamental
importance for the way we want to use Bayesian networks for problem solving actually
Bayes theorem talks about the situation where we have two variables A and B where B is

60
dependent on A or we can say A causes B. So about three kinds of probabilities that can be
considered here it's the independent probability of A we call it P(A) and similar for B we
have the prior probability of B, P(B) which means considering the probability for B in
isolation. Then we can talk about the conditional probability so we can see initially can talk
about the conditional probability for B given A and the and vice versa. So the condition for A
given B. Finally we can talk about the joint probability for both A and B considered together
and the joint probability for A and B can be proven to be computed as the multiplication of
P(B|A) and the conditional probability of B given A.

P ( A|B )=P ( B| A )∗P ( A)/ P( B)

And what is more interesting is how we can reason in this kind of Network backward so it's
the case that for many domains it's more likely that we know the data to support the
conditional probability for B given A which means is the for in the forward Direction, for
reasoning from cause to effect.

So therefore it would be very practical to have a theorem that can infer the opposite
probability the reasoning from effect to cause instead of directly observing it. So I am happy
that the Bayes theorem does exactly that same, what the Bayes theorem says that we can infer
the conditional probability of A the cause given B the effect to be exactly the conditional
probability of B given A time's the prior probability of A divided by the prior probability of
B.

So the intuitive meaning of course then is that Bayes Theorem really defines how one can
infer a conditional probability for a course giving us a probability of a symptom, what you
can see to the far right here is just a graphical intuitive way of defining that kind of proof that
can be given for this theorem.

So let's now look at the core components of this representation and so Nodes represent
variables in the Bayesian sense as described earlier can be observable quantities, Hidden
variables or hypotheses edges represent conditional dependencies. So each node is associated
with the probability function that takes as input a particular set of probabilities for values for
the nodes parent variables and outputs the probability of the values of the variable
represented by a node. So a prior probability if we look at the example to the right, is the
same we looked at earlier we have rain we have the sprinkler on we have grasswet. So the
probably of rain is given then by the two probabilities for the true feature values so if rain is a

61
feature it can have the value true we the 0.2 probability and it could have a value false with
its 0.8 probability. So then if we look to a conditional probability in this case we look at the
probability of the sprinkler is on given that it's rain, so we can see here that we have two
cases so either it doesn't rain, the rain is false and then it's a 0.4 probability to the sprinklers
on and it's a 0.6 probability that that was sprinkler is not on. In the opposite case when it
when it when it's really raining it's a much lower probability that it's the sprinklers on so it's
very rarely do you put on the sprinkler so in 0.1 case you have a sprinkler on. So to these
kinds of small tables are really the key information connected with each node in this kind of
belief network either prior or conditional and you see depending on how many connections
there are how many input edges there are 2 node you get a larger a table because typically
you describe the probability function here in terms of a table, of course also it depends on
how many feature values there so if there are it happen to be more than true and false while
he was an ordinal feature then also the table grows in in size. You also see on this line that
the joint probability function there is a way to calculate that from the prior probabilities and
the conditional probabilities and one can do that in a format manner so as been remarked
earlier and it's more likely in the main that we information about all these probabilities in the
forward Direction starting from causes and moving towards effects.

So let's then look at a minute at related examples so you recognize three of the variables, rain
sprinkler and wet grass what we did now was that we included a fourth variable called cloudy
and changed the structure so that cloudy effects sprinkler and cloudy effects rain but as you
see the connection between rain and sprinkler disappeared so this kind of alternative Network
is based for discussing another phenomena which we call conditional independence and I will
talk about that in the next slide.

Conditional independence means that nodes that are not connected by any path represent
variables that are conditionally dependent on each other so as you see in in there in the
alternative example where we changed the structure, sprinkler and rain is no longer related
because there is no path between them so therefore sprinkler and rain are considered as
conditionally independent and it's kind of obvious that the conditional dependency of
sprinkler given cloudy is one thing and doesn't change anything to include rain, so the
condition sprinkler given cloudy and rain is to say as the first. So in this case the probabilities
for the network can be calculated as below on this slide and as you can see the nice thing here
is that we can calculate the joint probability for all variables just by following the structure of
the network in a formal manner, so we essentially the joint probability is this product of the

62
prior probability for the independent variable times the conditional probabilities for the two
middle variables that all depend on cloudy and finally on the conditional probability of wet
grass given the middle two. So this is a nice property of this kind of networks that we can do
this kind of forward reasoning just following the structure of the network and using the
probability table probability function in each node. So the examples we've shown earlier here
are of course very small so to exemplify the principles however in realistic cases and here on
this slide I've included a more realistic one these kind of networks are large and can be very
huge however and due to the properties of the Bayesian networks that will try to illustrate in
the last few slides it should be scalable and manageable, but the example here is more the one
you would face in reality. Yeah so now I want to talk about a few structural issues concerning
this kind of networks, so one key thing is to understand that a Bayesian network cannot
include a cycle so you see here the left example, you see a valid the direct did a cyclic net
graph and even if one of the arrows seems to point upwards this is more of depiction problem
and and doesn't really cause any harm however as you see in the right example and the arrow
between A and C is now directed in the opposite direction creating a cycle and hopefully you
have understood from the way this kind of reasoning behaves is that it's strictly based on
working either entirely in the forward direction or in the backward direction, and it's
impossible to handle at this kind of cycle.

So the next structural aspect I want to mention is that of course when you study the same
networks you can see patterns and there are patterns of different kinds but there are few very
basic patterns and as you see here are three examples Sequence which is kind of obvious
convergence where we're two variables cause a third and divergence where one variable have
to two causes, so at that point we will not do much about this and it doesn't really we're not
going to use the knowledge of patterns like this but when you handle big networks and
especially if you want to modify networks and extend networks these structural aspects and
the ability to treat these cases in a uniform fashion is more important.

So if we turn to problem solving for this kind of representation it's already said that the main
idea is to use the networks to infer probabilities of causes from the probabilities of effects so
because they are Bayesian network is a complete probabilistic model of the variables and
relationship describing effects in terms of courses. Inferences typically aims to update the
beliefs concerning the causes in the light of the new evidence, so backward inferences in a
Bayesian network can be viewed as the answering all queries about the state of a subset of
variables (hypothesis variables) or the variables that are the potential causes when other

63
variables are observed typically those considered the evidence variables. And the main
vehicle for making this inference in the backward direction so to say is the use of Bayes
theorem which actually states that the conditional probability of the hypothesis given some
evidence is equal to the conditional probability of the evidence giving the hypothesis times
the prior probability of the evidence divided by the prior probability of hypothesis.

P ( H|E ) =P ( E|H )∗P ( E)/ P ( H )

So this Bayes theorem enables the carrying out of one inference step backwards in the
structure. But because of the homogeneity of the structure this kind of inference can be
recursively applied throughout the whole structure using the Bayes theorem in each step.

So what does learning mean in this kind of representation yeah actually there are two kinds of
possibilities for learning, one kind of learning we call it a parameter learning and as you
understand the only parameters basic parameters that we have are the conditional
probabilities given a fixed variable structures, so we then assume we have a fixed number of
variables and we have a fixed number of edges connecting the units corresponding to the
variables, and in for every node we have this kind of table that describes the probability
function for the conditional probability for that node. And essentially the low-hanging fruit
here is of course to be able to manipulate and enhance the quality of these tables across the
network so it better reflects the an adequate decision making for the domain and that of
course can then be based on available data but this is kind of obvious very standardized
learning process. What is more tricky of course is to learn in the sense of changing the
structure and one level there is of course to assume that we have the same variables which
means that we have the same nodes but we can modify the structure of edges among the
number the number of edges among the two nodes, even more advanced to the same is the
addition of new variables, typically is not so likely that we want to modify the low level
evidence we can say this is the input layer of the structure and not necessarily so the most
abstract hypothesis it's rather that it's more likely that we want to create some in-between
variables which are non-observables and we could call them Hidden and you will see that
there is a parallel here in the thinking about the world networks where we also will talk about
input layer output layer and hidden layers. So this was the end of this lecture thanks for your
attention and the next lecture will be on the topic of Neural Networks so thank you and good
bye.

64
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 14
Artificial Neural Networks

So welcome to the fourth lecture of the third week of this course on machine learning so we
will talk about artificial neural networks and their representation. So not official neural
network abbreviated ANN is a network of nodes or units commonly called Artificial Neurons
connected by edges the corresponding graph is directed and Edges typically have a weight
that can be adjusted the weight increases or decreases the strength of the connection.
Artificial neurons also have a thresholds that the signals is only sent from the neuron if the
aggregate signal from all input edges crosses that threshold. But furthermore before the signal
leaves the neuron the state of the neuron is computed from the input it's also transformed by a
nonlinear function, so furthermore Artificial Neurons words are aggregated into layers. So
typically units are sorted into layers and there are many layers, and the layers may have
different functions in the computation and signals travels from the first layer the input layer
to the last layer which is normally consider the output possibly after traversing the layers
several times. So hidden layer may have loops which is one variant of that is called Recurrent
Neural Networks.

So of course the inspiration for this computational model that was online in the last slide is
inspired by neuroscience by the way we believe neurons systems work in humans, in animals
and so on the difference is this course while the artificial neurons are digital in nature and real
neuron is an electrically excitable cell that receives processes and transmits information
through a combination of electrical and chemical silent psychic signals. So the real new
neuron is a very complex machinery in contrast to the rather simplified artificial Neuron. So a
real neuron consists of a cell body called the soma and which is the body although and then
you have the input side so you have the input side which called the dendrites. So the dendrites
receive the signals from other neurons, and on the output side you have something called an
axon and an axon is the output organ so to say from the Neuron. All of these structures are

65
pretty branched and much more complex than in the artificial case and so an axon can also be
pretty long so this means that the axon can reach a considerable distance to the next cell to be
a neural cell to be affected. But the basic functionality is signals come in via the dendrites are
kind of transformed, handle by the soma and then our output via axon. That’s the basic
neurological model. So in contrast to the artificial that systems that still are a reasonable size
the real neuron systems are huge and are in a huge scale processing in parallel so to give you
a figure the Human brain is assumed to have in the order of 100 billion neurons and a huge

number of connections so actual typical figure here for the number of connection is 10,000
times the number of the of the neurons itself and then you can you must imagine that the
scale of this parallel activity. Let's now be a little more concrete concerning the core
components of this kind of representation so let's look again at how we view the architecture
of a neural network unit .So the unit is supposed to have a number of inputs actually in the in
the real case in the real neural world we talk about synapses and dendrite and so on. In the
division case we have simply a number of inputs and these inputs are of course handling
signals coming from other units, every such input every such edge in the neural network is
given a weight and these weights key parameters for the productivity and the performance of
the networks. What happened in the cell body is when receiving the inputs the inputs there is
a summation performed where the inputs are the weighted sum of the inputs using the
weights of each input channel is performed. Then also a cell body is attributed with a
threshold and the negative value the negation of that threshold always sometimes called bias,
so essentially what we do is that we sum the negation of the threshold with a weighted sum of
the input and then if that value is larger than zero then the unit is ready to what you can say
quotation mark “fire” which means to send out an output signal but before that output signal
is sent out, the output value is transformed by applying something called an activation
function. which in a way modify the output value. And essentially in all these cases we
normally talk about numerical values.

So as the functionality of ANN unit is very important for the understanding of the whole
representation scheme that is repeated. So if we have an output of unit “i” let's call it “ai”,
this normally becomes the input to another unit “j”. So I is considered the predecessor of “j”
and “j” the successor of “I”. Each connection is assigned a weight in this case “wij”. Each
node has an activation treasure the negation of the treasure is termed bias “b” so for node “j”
we call it “bj” .In the body of the unit the weighted inputs are summed together with the bias,
which means that we say Sum wij times ai plus bj.

66
¿ ∑ wij∗ai+bj

If this sum is greater than 0 the output of unit j is calculated as a j is a function of the sum
where the function is a local activations. So this is the criteria for firing which means that
outputting a single from the unit and that the weighted sum plus the bias is larger than zero.

aj= f ( ∑ wij∗ai+bj )

So let's turn a minute to the so-called activation function there are a number of those actually
and it was you see on this slide a couple of examples essentially obviously that you see that
there are two categories here so on the topmost layer here of examples you'll find step
functions, you'll find something called a sigmoid function you will find something called a
hyperbolic tangent function in all of these cases the output level is maximized somehow. So
there is a there is a certain segment and when the output value is controlled by the function
but then it reaches a plateau. In the other category the lower part here you see that the output
is linear so the larger the original output from what value from the unities the larger that the
net output becomes so these are our two of functions. So now let's talk a little about the
structure given that we have delved into the functionality of the unit, so also the structures of
adjacent neural networks to somehow reflect our reflected in the history of the development
of the area so as I mentioned already in the first week of this course one of the key
contribution as early as 1957 was the Perceptron but essentially the Perceptron is a very very
simple network with only an input layer and an output layer. Still the unit functionality is in
the same genre as described here but in this case we only had two layers. And as you also
may remember from the first week this kind of simple structure was immediately successful
but soon proved very unpractical which meant that there was being a gap here from 1957 to
almost 1985 when this area had revival again and actually in 1985 the work introduced what
is called hidden layers, so while the pairs of journal only had an input layer and output layer
in in the early work of the 1980s Hidden layers were introduced and we will use the word
deep learning later, and in a way all neural networks that has more than two layers are
considered deep, however in the early work of 1980s and when we maybe had one hidden
layer or two hidden layers that term was not in wide use. So the introduction of the layers in
early nineties represented a major change and maybe you remember from what we discussed
earlier here on Basiyan networks there's a parallel you need an input layer it is when you are
in which you have to decode your input data you need an output layer by in a way can harvest
the results but you also need intermediate computing elements and this immediate computing

67
elements are the hidden layer. So then of course over time when this technology developed
and also computational resources became more available for this purpose, and the possibility
reintroduce many more levels ok so what's now talked a lot about and it's become very
successful yes what is now called the deep neural network but essentially the word doesn't
mean much more than that there is a system with very many levels actually.

And so finally and there is a slide here on different versions on neural network you can see
the you can see the Perceptron, you can see the feed-forward maker which had a Hidden
layer but not many hidden layers. What is interesting is in the middle of this figure because
there is a problem with networks that only are direct in a forward manner is that it's very
difficult to represent sequences of states and also to design a memory function so this means
that we one need to introduce loose (16:33) in the network to handle that.

So let's now look at how problem-solving is carried out for this kind of representation so you
hopefully now understood the basic functionality of the unit you have also basically
understood the structural properties of this representation. So when we have a problem we -
as always define a set of features to express our example as you learn away and then when we
have expressed all our examples in the certain kind of features we have to if want to use a
artificial neural networks representation we have to map the features onto the units in the
input level and depending on exactly how that looks it becomes more or less complicated
because we should understand that that the neural networks are purely numerical so the but is
an input level is a discrete set of numerical items so whatever input we have we have to map
that into this discrete set of numerical values. And as I said if we want to handle sequences
and not only a single level of input that has to be handled by a special kind of networks with
internal loops but we will come back to that later. Yeah in the same way we also have to
make the same kind of modeling for the output layer because of course in the output layer is
where we want to harvest our results and so the output from the output layers have to be
decoded back in into the original feature representations. And all the amount goes for any
kind of problems on it doesn't matter whether we try to do a classification of something, we
have a regression task or if the network are supposed to generate output actions in some
system. Also all neural network computation schemes have many so called hyper parameters
that control the detail behaviour and I mean just taking examples selecting exactly which
output function you should have been you know and so there are many of these like internal
parameters that guides the artificial neural network machinery, that typically need to be
adjusted to facilitate the problem at hand but then also typically the basic the basic flow of

68
computation is forward feeding machinery coming and going from starting from input layer
going through the hidden layers and ending up in the last output layer.

So on the following three slides I just show you a few examples these examples are pretty
standard ones there are no examples with the Perceptron, they're all examples from on the use
of networks with you with one hidden layer as it depicted in these cases and I think the
important message for all these examples is simply that the basic analysis around the domain
and the features you have to consider is not different when using an artificial neural
network ,you still have to analyze domain, you still have to choose the relevant feature set
and this goes for any learning approach. What you have to do here is you have to find of
encoding features in any reasonable way into the input layer and of course the simplest case
is that all your features are all you know the zero one and then you can just have one input
node for each of these or you can have a numerical value simply a numerical value share
something more complicated there is another mapping process.

So the last example and I want to show you has a different character so essentially this is an
example where and the input is not a well-ordered digital feature set but the input are images
so this means that if you want to give such a task to this kind of network it's not it has to
solve two kinds of tasks first of all it have to solve the original task the one where the input is
is well-engineered in a digital form and when you can do a classification task you can do
regression tasks as well, in this case you the network have to do that but also to analyse,
transform the image into the digital form in a sensible way. So for this kind of network the
network not only do learning in the form we have discussed upto now but also that it has to
perform an image recognition task. So this means typically that if you want a network to do
all these things it have to be much deeper they have to be much more room for internal
computation in this kind of network and also for the first image recognition part the network
typically have to be specially engineered and a special property structural properties to
manage that task but that we will come back to in a later week. So essentially in this example
illustrates the case where your input these images but you still want down out get something
semantically meaningful out of the system which is this case is the name of a lady of the
image.

So let's turn to now what learning means in this kind of representation yeah so as for the
Bayesian networks there are one can say two cases ,so the low-hanging fruits here than very
natural learning mechanisms are the updating of the weights of the edges with all the key

69
parameter of this kind of system but also for example the thresholds of nodes that could be
other small things too but the weights of the edges and the channels of the nodes are the key
parts the key parameters that can be updated and where learning can take place. We will not
delve into the learning mechanism this point but one of the most earlier well-known
approaches is an approach where the outcome the result the output from the network is
reviewed externally and after that review feedback into the system and then the learning
machinery is such that the feedback that would be fed back into the system is analyzed in
such a way that credit and blame can be given to specific connections and the ways of these
connections can be increased or decreased depending on who was to blame or who was to
give be giving credit. So this is one example of how learning can take place but there are
many options here and we will come back to those, if we leave that is of course also possible
to change the network in more dramatic ways so while in the first case the whole structure or
the network is supposed to be static and only the parameters are supposed to be chang, while
we also can consider the case where we can dynamically or update the network with change
the connections in the network take away connections, add connection but also introduce new
nodes and new levels which is of course the most advanced. Finally which is not mentioned
in the slide but I will say now that also various attempts also to actually learn the kind of the
feature sets that are actively used so learning the selection of features is would say also a
possibility within category 2 here. So this is the end of this lecture thank you for your
attention, the next lecture this week will be on the topic genetic algorithms thank you bye

70
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 17
Tutorial for week03

Welcome to this last lecture of the third week of this course on machine learning as always
the last lecture is an introduction to the assignments for the week. So this week the lectures
has focused on five specific sub teams which constitutes combination of specific
representations and computational models, and we have looked at decision trees we have
looked at Bayesian networks, we have looked at neural networks, we have looked at genetic
algorithms and we finally looked at logic programming. So the assignments for this week will
be strictly structured according to these five themes, so we will find five questions in each of
the groups, questions primarily based on the video lectures and then you'll find five other
questions also directly mapping to these five sub themes, but where you also may need to
look into the extra material suggested. So I will comment shortly on these five questions
Group one so one crucial issue for decision tree learning is as you may remember from the
lecture to understand in which order to use features to build the tree from and actually one
way of handling the problem is to create multiple trees, so the question is what is the general
term for the methods that take that approach. Question two then relates to not decision trees
but to bayesian networks, so one issue if you are not a very experienced statistician or
probability theorist is to keep track of all the various terms regarding probabilities. So the
question here and tries to brush up your mind and your memory on the various kinds of
probabilities that come into play in this context. So the third question for artificial neural
networks as I hope you remember from the lectures one important sub functionality with the
neural network is a kind of activation function employed in the output of the unit in a
network, so the question to you here is exactly what is the name of this this kind of activation
function?

So question four as you remember from that lecture when new generations of populations in a
genetic algorithm is composed, there is a few core operations needed and the question is what

71
is this particular kind of operation called? Finally in what is programming logic programming
is implemented through certain kind of theory probing and actually question is what is the
name of that kind of theorem programming that is employed in logic program?

So a few comments on the additional material recommended this week so there is one article
on each of the subjects and actually the first article is one of the classical ones describing
work back window once I mean in a early work on induction of these decision trees. The next
article is an example of how to apply Bayesian networks in some sector and the third is
actually also a classical paper so this is one of some papers published by Rumelhart and
associates in 1985 which describes the one of the first approaches is to learn neural network
parameters by backward propagation of feedback based on the output of the network results.

The fourth paper, the fourth material is essentially a more an overview material on genetic
algorithms while the fifth paper is a pretty late paper actually on an inductive logic
programming as the most important paradigm for learning in a logic programming
framework.

So there is also a second group of questions related to the five sub-areas most of them need at
least some research into the suggested additional readings, now the first question would
relates to inductive analysis and hopefully mainly due to look a little into the different
categories of algorithm produced in the induction of decision trees. The second question
relates to Bayesian Network essentially about how learning can take place in such networks.
The third question have to do with this central mechanism for giving credit or blame to
artificial neural networks based on feedback given by the environment as a response to
actions taken based on neural network output. And the fourth question it's just a basic
question about the terminology genetic algorithm, well the last question have to do with one
of the types of learning that can take place within the framework of inductive logic program.
So this was the end of the last lecture for this week the next week we will continue with the
following team and that team is inductive learning based on symbolic representations and
weak theories thank you very much for this week bye.

72
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 15
Genetic algorithm

Welcome to this fifth lecture the third week of the machine learning course. The topic for this
lecture is genetic algorithms. So genetic algorithms is a specific an early representative for a
class of computational models called evolutionary computing. As for all efforts in
evolutionary computing it's inspired by theories and models from evolutionary biology, so
genetic algorithms are commonly used to general global solutions to optimization /search
problems. Genetic algorithm particularly useful for problem domains that have a very
complex optimality landscape. The simplest form of genetic algorithms is based upon a
representation of chromosomes, that's the data items in the form of such simple binary
strings, with discrete functions for evaluating fitness of this chromosome fitness of the data,
and syntactically defined breeding and mutation operators designed specifically for the binary
strings. The idea of genetic algorithm could be engineered for much more complex
representations on binary strings that's absolutely possible but that is not treated in this
lecture.

So let's look at a moment, after this interdisciplinary sources of inspiration for this
representation. So genetic elements are inspired by Darwinian evolutionary theory, an ideas
of the survival of the fittest by natural selection. You can see below some pictures that
illustrate the functionalities of what this gradual organic development of the natural species.
So in Darwin's theory of evolution, populations changes over time and he was the first to
propose a feasible mechanism for how this happen. So if we look at the core components now
here so first of all what is the data set, the data set here is called the population and the data
set consists of data items, so in this kind of framework the dietary terms are data items are
called chromosomes where each chromosome in any way datum represents one potential
solution to the problem at hand. And as I said earlier with what we discussed in this lecture
it's a very simple model where each data items is a binary string and so this means that in this

73
terminology used in genetic algorithm the gene is one position in such a string. So typically
what we also do is a very basic operation is to rank each other chromosomes in a continuous
fashion, in a regular fashion so this means every chromosome, every data item in every step
of the process is evaluated by applying a fitness function, and of course the idea with all
process is to produce in the end the fittest chromosome, one chromosome that has the best
performance.

So what is the basic machinery of a generic genetic algorithm system then yeah the key issues
are the following, so we already talked a little about that the key the basic representation so
what happens is that we have this population of chromosome, then we have a computational
cycle where in each step the fitness of all the whole population is evaluated and then after
that a selection is made, subsets of the fittest from the population are chosen to directly be
included in the next generation and the rest are discarded. Then and the selected ones are also
allowed to mate, so actually pairs of the remaining ones, the ones who are fittest are allowed
to mate to general children to restore the population site. This mating is typically carried out
through a process called crossover which we will be I will describe a little later. After that
one also allow a step where the new set of data items or chromosomes can be randomly
mutated, when that takes place we have a new generation and then new cycle forms.

So one very important ingredients in the genetic algorithm machinery is the fitness function
mentioned earlier, could also be called an object function or evaluation function. This
function looks at every data item and evaluates how close that data item is to a given solution
and because it has such a crucial role the success of an implementation of genetic algorithm
us depend on this very clever choice or of the fitness function. It's very important, the few
things are important it's very important that's clearly defined it's understandable. It's
important that it generate intuitively reasonable results. It can be it, doesn't need to be simple
it can be complex but it should be easy to understand and it should create intuitive results. It
should also be such that it's efficient to implement because it will be evaluating many times
so if you can consider you have a large population and you run this many times there is a
computational issue here involved, and syntactically issue it is supposed to produce a
quantitative measure real value that can discriminate the chromosomes as much as possible.
So it was mentioned earlier that in the production of new chromosomes from parent
chromosomes there are few key operations involved, so for mating between all chromosomes
the crossover operation is used and then there is also mutation operation normally applied. So
crossover them need some explanation, so single point crossover exactly essentially taking

74
two strings, two parents choosing one point in the binary string and essentially swap the all
the bits to the right of the power of that point. So as you see the examples the two parents left
part up to the crossover point remains the same while the right strings are swapped giving
then two children. And the alternative is to have multiple in crossover and which means that
in the case of two-point cross of over two points are picked randomly and the section between
the two points are swapped while the beginning and the end his remains the same and this can
be generalized to key point but that's not it very important. Okay so this is crossover. And
mutation operation is very simple it means simply to flip a bit at a random position. So the
binary representation genetic programming scheme is a very simple straightforward and
appealing, but there are a number of drawbacks with this and I want to mention a few issues.
So it's crucial and non-trivial how the features of problems are mapped to binary strings.
Already when we talked about neural networks we realized that it's not necessary a simple
problem to take a set of features in some form and mapped them into the input layer of a
neural network. In the genetic algorithm setting this becomes even more tricky because if you
I hope understood that these chromosomes are changed and modified in a synthetic tactical
manner which means that the crossover for example operations we break the binary string
and there is various parts and it cannot happen of course that kind of cross over breaks apart
bits that anyway belongs to the same feature. So the modelling of the feature set onto the
chromosome binary strings is a non-trivial task. Also the choice of fitness function in is
absolutely crucial. So the that is also an issue because it's a single component in the system
and its behaviour it’s so important. Yeah and as I already touched depending on how the
feature values are modelled onto the chromosomes one may have to restrict the crossover
mutation operators, so they cannot produce non meaningful chromosomes from a problem
domain point of view. Also here that a lot of parameters, hyper parameters that those exist for
any system, I mention it for Neural networks but here there are many and these have to be
adjusted typically, and those parameters are such as the size of reproduction subset, the
mutation frequency, the various crossover policies etc. and then also a problem for this also
for as for your networks that the amount of computation needed are our typical immense for
for this kind of setup.

So I have included a more detailed example here and it's a very simple example and on this
first slide you can see the setup so you have a number of binary digit constituting the
chromosome of the size of eight we have four data item’s we have a function of Fitness
function that counts the number of ones, and we have a crossover rate and which guides the

75
crossover functionality we have a mutation rate and so on. So the hyper parameters are set
and outline is done. So then in the next part of the example and the Fitness functions are
evaluated and normalized and then the crossover operation is performed and then the
mutation form which then creates actually a new population. And as is observed here it's not
always that it always goes locally in the right direction it can be so that some of the fittest
candidates disappear but then reoccur, actually when one very good property of genetic
algorithm that is not sensitive for local optimization with which many other algorithms suffer
from.

So for your convenience I included two other examples here, there are new aspects of these
particularly importance it's just that you can see some of alternatives examples of the first, the
first example is a solution to the Eight Queens problem, so as you can understand one issue
here is to code the positions of the on the chess board of the eight Queens in a form so it can
be mapped onto a chromosome. So in the second example you can see the problem is to
maximize a numerical function, so this is just another kind of case.

So then the question is how does genetic algorithms relate to learning and of course there can
be many possible connections here but, the one I want to mention at this point is called
classifier systems. So essentially a classifier system is production system as it was described
in one of the earlier lectures in the first lecture this week and essentially what one does in a
classifier system is that the production rules of a production system are mapped onto the
binary strings or a genetic algorithm. I knew already seen some sites ago you need to do this
marking so for the 8 Queens problem you have to do it in some way, well maybe do another,
so of course nothing prevents that you can take the symbolic rules of a rule based system or
production system and map them onto the binary strings. So this means in a way that you can
utilize the functionality of the unical going so that the population the data set is exactly a
number of data items or chromosomes that are equivalent to an initial rule set that is supposed
to solve a certain problem. So but then of course you need a separate model of this module in
the system so where you in every generation apply this system the current generation to a
specific problem and typically that module acts a lot like what was described in almost
previous slides on rule based systems, so essentially there is a some kind of inference engines
that tries out which satisfies, matches the problem at hand and of course you also mean then
in the system some reinforcement feedback so that credit and blame can be given to
individual contributing classifiers, and there is actually an argument algorithm for that

76
compacted brigade every which is it's very similar to what in the neural network case is
called back propagation.

So the feedback to the individual classifiers from the result of the application of the rule set
for the basis for the fitness evaluation of the population, but for the rest you know you normal
genetic algorithm machineries run which means that there will be a new generation rules
every time the new unit will fight again to the problem there will be feedback generated that
will be created a blame to the to the individual classifiers and so on and so on. So this is one
approach to combining a learning scenario of a rule-based system to the genetic algorithm
setup. So in this line it is just a repetition of what I said a little earlier, so essentially a
classifier system is a combination of a generic algorithm running and a kind of rule-based
reinforcement based system so essentially have a rule based the rule base you see in the
picture the constitutes the population genetic algorithm. But in every step of the process this
rule based is applied to a certain problem task and the result of that problem solving is then
fed back in into the into the root system again so that credit and blame are given to the
individual rules and then the credit and blame the distribution of the rules are used as the
basis for the fitness function and then the genetic algorithm creates a new direction and so on.
So this was the end of the fifth lecture this week, so the six lecture number six will be on the
topic of logic programming. thank you goodbye

77
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 16
Logic Programming

Welcome to this lecture the lecture number six of the third week of the machine learning
course and the topic of this lecture is logic programming. So logic programming is an abstract
model of computation, in this lecture logic program is illustrated by the Prolog language
other dialects of logic programming exists with slightly different properties. The logic
program is an encoding of a problem in logic from which a problem solution is logically
derivable. The execution of the program can be considered as a side effect of a theorem
proving process. So an important concept here in logic program is of separation of programs
into their logic components and their control components as captured by this local slogan,
‘Algorithm is equal to logic plus control’. So essentially the statements of logic programming
represent the logic component while the machinery of the Prolog system in this case as we
discussed represents that the control is use. Also a logic program can be regarded as a
generalized relational database actually including both rules and facts.
So logic programming is a computational model emanating from theorem proving in
predicate logic as investigated in Theoretical Philosophy. The logic program is a set of
axioms and Logic programming can be viewed as control deduction. The logic programming
engine tries to find a resolution refutation of the negated query. The desired computation is a
side-effect the refutation proof. The resolution method used in logic program is called SLD
resolution.
So let's a look a little closer to the core components of logic programming. Logic
programming is based on based terms and statements, statements can be facts, rules and
queries which are all syntactically so-called Horn clauses. Actually facts and queries have a
very similar form, they consist of a single goal we come back to what the goal is a fact is
ended with a period and the goal is ended by a question mark, so the discrimination is very
discrete I'm gonna say, you can see to the right. So the fact is like father, John is the father of

78
top and an query can be X is the father of Tom please give me X. That's how it should be
interpreted. So you tactically similar but with a very different function so a rule has the form
A is implied by BB2,BN where A is the head and B’s are the body and all these the A's and
the B's are Goals.
So what is a goal then, so goal facts queries and rules are statements however a key part of all
these are goal. So Goal is a compound term and actually compound term is a functor which
is an atom which is a constant followed by a set of arguments that in turn can be a term and a
term can become compound term of course or variable or an atom which is equivalent to a
constant. So for example in the little example to the right you can see father is a functor
actually but it's also an atom. John is an atom, Thomas an atom, X is a variable. Grandfather
father, and Parents there are all factors. Variables is very important here because variable
used for a matching process in the computation and Variables are Universally quantified
within the scope of the statement, but so for every statement doesn't matter when there's a fact
or rule if variable is mentioned there it's only variable in that limited context. So variable is
not a variable in the whole set of all the statements as it is a manmade often the case in a
many other programming settings. Also statements of all kinds that have only atoms are
called round statements there are essentially facts or generalization of facts and a statement
we have a variable is called non ground. So these are the basic ingredients the basic way of
expressing yourself for logic programming.
So what I show you here is a slightly larger example of logic programming and as you can
see in the top there are facts which is essentially predicates, so the predicate male states that
John is a male, Tom is a male, George is a male, Tim is a male, and Ann is a female and
dianna is a female and so on, so all these are facts and they are supposed to be all facts that
supposed to be terminated by period. Then someone can say that this kind of all facts
corresponds to two features or attributes of objects then we have relations between objects oh
so I think you see in the next part so we have facts that gives relation between those objects
and you know in this case is their family relationships. And then follows different kinds of
rules more or less complex that expresses some of them are simple some of them express
more complex things into several other sub-samples and so on. And then in the end at the
bottom line you find the kind of queries you can put to a system which has the same form of
facts but they are all followed by question mark and they may many times involve variables.

So the execution of a logic program and the idea here is to give you a flavour of this
computation happens. So the execution is initiated by the user’s posting of a single goal

79
called a Query. You'll see an example of query. The logic program uses that tries to find a
resolution refutation of the negated query. The negated query can be refuted, it follows that a
query with the appropriate variable bindings in place it's a logical consequence of the
program. In that case all generated variable bindings are reported to the user and the query
said to have succeeded. So essentially the computation is the indirect effect of the proof and
the result or the variable bindings, so operationally the logic programming execution strategy
can be thought of as a generalization of function calls in other languages because you can see
that you have a query and the query is matched to some to some of the goals in the set of
rules and there is a match and if there is a match that results in the invocation of the right side
of the rules, so you can see this is a recursive set of function calls. The difference is that in
contrast to normal function calls where we're totally deterministic which function is called in
this kind of language multiple Clause heads that multiple rules can match a given call. So in
that case this kind of system makes it choice so it takes the first rule that match the goal and
notes that point and continues. So with any goal fails in the course of executing the program
in the continuing execution all the variable bindings that were made since the most recent
choice point of made are undone, so essentially if everything works well then you always
take the first choice in every step, but if something goes wrong along the line you're going to
go back and then you have to redo all the bindings you have made up to the point where at
the first point way where you can take a next option. So this kind of execution strategy is
called chronological backtracking. So essentially what the Prolog system does is tries
different options. So one can see it as a search strategy through the various options that exist
for this particular computation.
So what does learning mean for this kind of representation actually there is a specific subfield
a special subfield called inductive logic programming the subfield of machine learning where
Prolog statements are induced from examples. The standard logic programming
representation is used to uniformly represent both examples hypotheses and background
knowledge.This is a big feature, it's actually a big advantage because in many other learning
settings you're forced to put a lot of effort on the syntax of the of the language in which you
express your examples, the language in which you are supposed to express your hypotheses,
and thirdly a language for its expressing background knowledge, and in many cases there are
different and there are a lot of mappings and other discussions needed to make a good
engineering here.
So the uniform language it is a big advantage. So what do you even input them to a inductive
logic programming system, yeah what you input is actually the dataset that you expressed in

80
Prolog if this now is the language we look at and also the relevant background knowledge
and this is pretty straightforward. The output is actually not the program statements that
created usually created by the system that can entails all positive and no negative examples,
of course learning program is such does not really support learning but I think what I'm
argued for here is that given that some learning algorithm can apply we applied logic
program is a very uniform and nice framework for integrating these components in the whole
system.
So this was the end of this lecture thanks for your attention we are now finished with the
main lectures of this week so the next lecture will be understand a topic of tutorial for week 3

81
Machine Learning,ML
Prof. Carl Gustaf Jansson
Prof. Henrik Boström
Prof. Fredrik Kilander
Department of Computer Science and Engineering
KTH Royal Institute of Technology, Sweden

Lecture 18
Inductive Learning based on Symbolic Representations and Weak Theories

Welcome to the fourth week of the course in machine learning this week as the title Inductive
Learning based on Symbolic Representations and Weak theories. So first I want to make
some comments on the title of this week, so inductive learning is kind of self-evident, so why
is it called learning in symbolic representations. So I hope you have already understood there
is for a long time been a balance in artificial intelligence between work based on symbolic
representations and work on sub symbolic representations. And by symbolic representations
is meant Symbols, Lists, Semantic Networks, Bayesian networks, Production rules, Logic
decision trees many of those things that we introduced last week. In contrast to that we have
Sub-Symbolic Representations Neural Networks, Genetic Algorithms, Binary Strings and
you also heard something about that already. And for any part of artificial intelligence oh you
could look over the years the focus of work has been on one or the other, and it has gone a
little up and down. At the moment in machine learning there is a very strong focus on sub
symbolic representation in particular Neural Networks but that does not mean that that all the
other work such as all the other streams of work are dead or drying up and all the different
approaches continue but at any point in time the line 90 days is one of them. What is
important to say is that independent if you have a focus or one of the other symbolic or not
sub symbolic there is also a common denominator here you know casein that mathematics
because both can really describe what they do more or less purely in mathematics and
mathematics is the foundation for both these approaches. So anything regarding groups,
numbers and sets and graph factors lattice, tensor so to say and these are relevant under all
circumstances, but essentially this week we will still focus on symbolic representations.
So let me know that is try to explain the reason for the last part of the title learning in Weak
Theories and the motivation for including that part in the title is that there are two clearly
two categories of work in machine learning. So I would say the largest amount of work has

82
been put on what is termed here learning in the presence of weak theories, in the meaning that
this is creating an abstractions from sets of instances with a minimal very or very weak or
almost absent model or backgrounds. Of course there is always some background knowledge
I think we already talked about that for many of the methods and an algorithm where we are
studying there are a lot of parameters that have to be set, we have talked about biases that can
be included into the languages we use and so on and of course these all these things these
biases permit these things constitutes background knowledge and we cannot be never be
entirely without it but in many of the cases we set up a bar and many cases will be studied
this week there is very very little such knowledge. So this means that we more or less entirely
learn the abstractions from by generalization from the instances but typically also in order for
this to work we need normally pretty large sets of instances. So that's the scenario we are
focusing for the mode and the opposite situation which we will go more into coming week is
how you can create abstractions but also more if are abstractions on the border of an existing
model or background theory that is more substantial, it could be that is almost complete and
just to be a adapted slightly or it can always partial and need to be extended. In these cases
you may have a lot of instances but these techniques could also work with smaller set of
instances because there is so made much guiding knowledge and for actually guiding the way
the abstractions are formed so this is the reason for this distinction so this week we will still
focus on induction in a Weak Theory scenario.
I will now give you a very brief overview of what's going to happen this week, so we will
have four lectures apart from the last tutorial lecture concerning the assignments and these
four lectures has the following teams. So the first is Generalization as search which is a
classical abstract framework for this describing how generalization can be performed.
Secondly we will look at learning algorithms for Decision Trees the further which was
decision trees was introduced last week as a formula, now we will look into some algorithms
in that area. After that we will turn to Instance Based learning and as mentioned on an earlier
lecture already instance based learning is the approach where you actually not explicitly store
or build up any abstractions rather that the abstractions are implicitly defined by the way we
structure our memory of instances. In all these three cases we primarily focused on
supervised learning which means that the instances we look at are all labelled typically with a
class label, and also we also partially looked at the case where we have a numerical output
but which is what we call regression, but still it's super vast. In contrast to that we will finish
the week with a lecture on what is called clustering which is mean one of the technique type
of techniques that is crucial for the situation where we have unclassified input and the

83
situation which we here on the course have termed unsupervised learning. So in the first of
these lectures generalization is certain we will introduce a very general framework introduced
by Tom Mitchell many years ago, and one can say this is an interesting framework because in
a very abstract way it tried to illustrate what happens when linear in early generalization takes
place doesn't matter whether it takes place in one formulas more in the another. And we will
spend some time on looking how what representation we can use and discuss that and then
we will spend most of the time of what is termed data-driven strategies for generalizations
which is essentially where we start from our instances and we actually based on a walk
through the breadth-first walk through all the instances but to find in order to find the solution
make a search of hypotheses space that we built up and, then we can do that either in a depth
first manner or a breadth-first manner or in a kind of combined approach we call as the
Version space. So this will be the first lecture. So next lecture will be about decision trees we
will study a specific category here which we called TDIDT which is a nice palindrome by the
way but which means top-down induction of decision trees we will look in more detail on a
particular algorithm that been important for this area called the ID3 algorithm and we will
discuss a few crucial issues around the application of that algorithm. The third lecture is
about instance-based learning them meaning looking at learning techniques where we
essentially store all instances but don't know build up any abstractions and essentially this
lecture have two parts, the first part is around one kind of algorithm which is called the K
nearest neighbour algorithm and we will discuss the properties of this kind of algorithm. And
in the second part we will study a some machine learning schemes that are very important
today because there been many successful applications, they are discussed here because with
Logic tree did they go very well together yeah with the instance based learning approach, and
actually the first part here is something called linear classifier which is the basic kind of
machine learning strategy and then a particular form of linear classifier which is term support
vector machines. Finally we will look into some specific techniques which call kernel
methods that we will apply to the linear classifier in order for that kind of approach to also be
able to handle nonlinear cases. The last lecture will be about Cluster Analysis and cluster
analysis being a technique that need to be applied typically in the first stages of an
unsupervised learning scenario, where we have to look at large sets of unclassified examples
and essentially there are many approaches in this area I think it's been said there are more
than hundreds of algorithms that try to solve the clustering problem and what we will do here
is we will try to sort these all these approaches into five categories Partition-based,
Hierarchical Based, Density based, Grid based and model based, and I will try to in a pretty

84
summarized fashion give you a picture of the approach for each one of these categories. So
this was the end of this introductory lecture, thanks for your attention we will continue with
the next lecture on the topic of generalization as search thank you very much.

85
[Music]

Welcome to this lecture on Generalization Search which is the second lecture of the fourth
week of this course in machine learning. For practical reasons this lecture will divide it up in
two videos, so this is part one. The outline of this lecture is as follows we will first start with
some general characterization of the aim of this approach, we will then look on the kind of
language and formalism we need in order to work with this framework and then we will look
at the two main categories of strategies for doing this, one we can say Generate and Test is
essentially it's a top-down approach where we in rate hypothesis and then search through the
instance space in order to iteratively test which hypothesis is optimal. The other category
called data-driven strategies for generalization where we actually start bottom up from the
instances and we will systematically through control analysis but instead search what we call
the hypothesis space that we have built up, and we will look at three ways of searching the
hypothesis Base first, Breadth first and Version Space approach.
So the purpose of this lecture is to compare various approaches to generalization in terms of a
single framework which we term here generalization as search. Towards this end, we want to
cast the generalization problem as a search problem, and alternative methods for
generalization are characterized in terms of the such strategies that they employ. This lecture
is tightly based on a seminal paper by Tom Mitchell from 1977 called “Virtual Spaces: A
Candidate Elimination Approach to Rule Learning”.
So the preconditions for the problem we're going to look at are the following, so we have a
language in which to describe the instances which we call an instance language. We have a
set of positive and negative training instances of some target generalization that we want to
learn. We also have another language we call hypothesis languages, the purpose of this is to
describe our generalizations and this hypothesis language which spans what we call our
hypothesis space. And the hypothesis space as we are going to construct it is a Poset (a
partially ordered set) organized by a more specific than relation that relates more general
concept to more specific. And then finally we need a kind of matching function or predicate
that can test at every point in time whether given instance and sum over our generalizations
match. So what we want to determine Generalizations within the provided hypothesis
language consistent with the training instances we have which mean that they are consistent if
and only if hypothesis matches every positive instance and no negative instance in the data
set. This approach is dependent on two constraints, so first of all it's assumed the training
instance is contained no errors and it's also is constrained by the fact that it's important in

86
necessary that the target generalization can be described in hypothesis language we have
designed, So we can say that this is a theoretical framework because it cannot handle some of
these practical problems which are formulated here as constraints.
Let's start with saying something about the hypothesis language, so the choice of the
hypothesis language has a major influence on the capability visually of the learning system.
So by choosing a generalization language you fixes the domain of generalization which a
program may can describe and therefore learn. So most systems in some way use a
generalization language that are biased in the sense that the language is capable of
representing only some of the possible sets of describe people instances not all. These biases
causes both the strengths in the weakness if it's the biases inappropriately chosen it can
prevent the system from inferring the correct generalizations, if it is well chosen it can guide
the induction so it can actually enable inductive leaps steps beyond the information directly
given by the instances. So there is a bad side a good side of this kind of language, also the
choice of language has a strong influence on the resource requirements, so the complexity of
the hypothesis space, if you represent hypothesis space by graphs because the problem can be
exponential, while if you use feature vectors for example you can keep the complexity down
to linear, and also a language where the ordering you introduce is shallow and branchy that
will typically create a larger hypothesis set in contrast to wherever the ordering is narrow not
so branch in each step but rather deep.
With the approach we have taken the most important choice to make for a language is which
relation to use to relate our hypothesis. So our choice here is a more specific than relation
between hypothesis and the semantics of this relation is given as follows, so if we have to
generalization G1 and G2. G1 is more specific than G2 if and only if the set of instances that
G1 matches it's a proper subset of the instances that G 2 matches considering all the instances
and the matching predicate. So you should note that this definition of the relation is
extensional is based upon the instance sets that generalization covers, of course the definition
of the relation is also dependent on the language we have in the matching predicate. So the
more specific than relation imposes of what is called a partial ordering over the
generalizations in the hypothesis space. It provides a powerful basis for organizing the search
through the hypothesis space. And finally one important fact is that in order for the more
specific than relation to be practically computable by some program, it must be possible to
determine whether G1 is more specific than G2, only by looking at the descriptions of G1 and
G2 without computing all the sense of instances that very much. So even if the relation is
defined by the instances it's not tractable - to make judgments of the relation between

87
hypothesis by locating instances. So this requirement that we have to be able to decide on the
validity of a relation in space of the description places restrictions on the way our language is
formulated. Before we continue I want to say a few words about partial orderings, in
mathematics a partially ordered set formalize the ordering of the elements of a set. So a
partially ordered set or Poset consists of a set together with a binary relation indicating that
ordered pairs of elements in the set, such that one of the elements precedes the other in the
ordering. The word partial means that only a subset of pairs of elements are directly ordered,
as you can see for example in the example to the right where we actually look at subsets of a
three element set XYZ, we can you can see obviously that the subset consisting only the
element X is not directly order with respect to the subset Y said as an example, while it's
order with respect to the subset XY and XZ if every element in the set is order would we talk
about the total order instead.
Before we go on I want to introduce a simple example that I will use for this lecture, so the in
this example our instances will be unordered pairs of simple objects where each object is
described by three features, the features are shape, color and size and the feature values of
shape is square, circle and triangle the feature values of color are red, orange and yellow and
the features of size are large and small. And so actually the instance language if we look at an
example there you can see instance one is consist of two objects a large red square and a
small yellow circle. So that's a instance the hypothesis language in this case follows more or
less the same syntax, the only thing we do is a add a wildcard symbolized by a question mark
and actually the purpose is that is to function as a generalization for the features values of a
specific feature.
So it's a possibility here to generalize feature values but just in one step using this wildcard.
Let's now look at an example so a small example with just four, hypothesis structured in a
small network. So we have G 1 G 2 G 3 and G2 you to more general then the others and as
you see in the example, ordered from general to specific. So you can see here G1 is more
specific than G 2 and you can also see the G3 is more specific than G2 but you can also see
that because the way we define this network through the more specific than relation G1 and
G3 are not comparable generalizations even though the instances of G3 and G1 intersect, but
the sets belonging to these two generalizations do not contain each other.
So much about the hypothesis language and its properties so now let's look at different kinds
of generalization strategies. So on one hand we have the data-driven strategies and those will
be the focus of this lecture and, in that case the instance base is traversed systematically and
as a consequence the hypothesis space is then searched, and when we say we use one or the

88
other search strategy we talk about applying that such on to the hypothesis space. In Generate
and test strategies we essentially we start from the hypothesis space by traversing the
hypothesis space, we look at the instance space and such that so it's a general to specific
approach.
First a few words about the generate and test strategies. So in generating and test we generate
new hypothesis according to some procedure and this procedure they're typically independent
of the data set of the input data. So each generated hypothesis in the hypothesis space is
tested against the entire data set available and for every step the candidate hypothesis is either
identified as an acceptable hypothesis generalization or it could be viewed as a node to be
expanded further to create new hypothesis or one can say one can say it's a dead end and
prune it away. So generate and test strategies typically consider it is all data instances in the
data set at each step, for each new generated hypothesis to be tested. It is also an property of
this kind of algorithm is that because for each hypothesis they look at all the instances not
only single instances. They are not so prone to deteriorated in the presence of noise because
the noise will just be part of the stream of all the instances. On the other hand it's a problem
because if you if we have a batch learning approach it's fine but if you have an incremental
situation it's more of a problem bit because if new data comes up later in the process you may
have to re-execute the whole generate and test procedure. Also because of the fact that the
generate hypothesis is not influenced at all by the data the search can be quite expensive.
We will now start to discuss a number of data-driven generalization strategies so actually
many generalization programs employ search studies that are data-driven, which means that
we build up and hypothesis space and then we revise what we believe to be the current
hypothesis or the current hypothesis based on all the incoming data instances. So we will look
at three kinds of approaches Depth first search, Breadth first search and something called
Version Space approach, and for each of these approaches we will do four things we will give
some general characterization of the approach, we will sketch the prototypical area, we will
trace a simple example and give some comments on that little trace.
Let us start with the Depth-first strategy. So one of the strategies for utilization from a
company is depth first in this strategy. We keep a single generalization like single hypothesis
as the current best hypothesis and we call this presentation we call it CBH, so the start search
start by choosing taking the first instances and choose a CBH consistent with that first
instance. And then we test systematically the CBH against the new training instances and we
altered the CBH and we can either alter it so it becomes more restrictive and this is because if
we encounter a negative observation that we will do not want to be covered, we can relax it

89
more because we also have to make it want to make it cover a new positive observation.
However when we alter it we'll also have to look back at all the previous observation to
ensure that we do not create an inconsistency with those instances already managed. And also
of course at every step there are not just a single way of altering the CBH consistent or good
you see there are many options we have to choose one and for all the options there are we
have to try them one by one and we do that in some order and if one of the orders later we
became obvious that what we have choice we have made is not the optimal one we made you
have to do some backtracking going back to that decision point and chain and take a second
choice. So drawbacks with this is it's very costly at every step to check for consistent with all
past training instances and also if we happen to make the wrong choice at one of these
decision points to backtrack, to reconsider the alternative choices at that decision point.
On this slide you see some pseudo code for the depth-first search strategy procedure. I
already outlined for you informally how it works and essentially the role of the similar code
is just to formalize this one step further. So what you can see here is pseudo codes are
actually two parts in the beginning, one part that handles the negative case and as I already
said when you when we encountered a new negative instance is in this process of revising the
currently by best hypothesis, we need to constrain the CBH, so that it do not cover that new
negative instance. On the other hand there may be many ways of changing the CBH in that
direction, so therefore we have to consider all the earlier instances so we don't get an
inconsistency. Similarly in the positive case, we may normally need to extend the CBH so
that it also covers the new instance to generalize the CBH, but when we do that we also have
many choices so therefore we also need to reconsider they're all the instances so we are not
extending it so much that it covers some earlier negative instances. And finally in the psuedo
code you can see the section where we come into a situation that we discover that we cannot
find a suitable revision which means that the probable situation is that we made the wrong
choice at an earlier point among all the earlier points we had, so therefore we need them to
backtrack to it to the to that to an earlier decision point and I mean backtracking can have
many forms what we talk about here is the same version form of backtracking which is called
a Logical backtracking essentially we go back backward to the nearest decision, and find a
new alternative choice there, and so and of course that can be recursive because we have
many levels of decisions, so as already said earlier this can be a pretty costly procedure.
So let's look now at an example on how to run the depth-first search and on a trace for that
example. So in this example we have three instances, first we have two instances which are
positive and then we have one negative instance and what we what you see on this slide is the

90
current best hypothesis in three versions representing three steps of the depth-first search a
generalisation algorithm.
So some comments on this example, so the first positive training example leads to the
initialization of the current best hypothesis to CBH1 which matches no instances other than
the first positive instances. When the second positive instance is observed CBH1 must be
revised in order to match the new positive instance and you should notice here of course there
are many plausible revisions to CBH1 in addition to see CBH2. What is not visible here is of
course the order of these alternatives but the system picks what is considered the first
alternative, yeah so now we go to CBH2. So after these two positive examples comes a third
training n cells which is negative and being negative it conflicts with CBH2 in this case all of
the ways to could be specialized to exclude the new negative instance. No such relation is
considered with the earlier observed positive instances already looked at.
Also one observation we should have her in order to understand the example is that the every
instance is an unordered pair of objects, so it doesn't matter in which order the two objects are
placed within the instance. But actually now and considering the third negative instance the
system must and backtrack to an earlier version of this current based hypothesis
reconsidering its previous revisions to determine a CBH3 that is consistent with the new
negative instance as observed positive instance.
So this is the result in the end. For practical reasons not to make this video too long we make
a break here and continue the treatment of the other algorithms in the part 2 video. thank you
bye.

91
[Music]

Welcome back to the lecture on generalization as search which is part of the fourth week of
the machine learning course. We did make a technical break in in this lecture in order not to
have too long video sessions and we did make the break when we were starting to discuss
generalization methods. We did look at depth-first search algorithm and what we are not
going to do is to look at the second kind of algorithm, the breadth-first. So let's now turn to
the breadth-first search strategy once on this slide you can see it's also termed specific to
general and we will come back to that prefix so in contrast to depth first search breadth first
search maintains the set of several alternative hypothesis not only one. The current set of
hypothesis and as you will see the most specific such hypothesis is termed S and S is a
generalization consistent with the observed instances such that there is no generalization
which is more both more specific than S and consistent with the observed. So again we in this
method we work from the most specific hypothesis that can cover the instances seen so far
and generalize those generalizations when needed. So starting with the most specific
realization the search is organized to follow the branches of the partial ordering so that
progressively more general generalization or considered each time the current set must be
Modified. So we initialize as to the S of maximally specific generalizations consistent with
the first observed positive training instance. In general positive training instances force the set
S to contain progressively more general hypothesis if we encounter negative instances we
need to eliminate some generalizations from S thereby pruning branches of the search which
have become overly general. So this search proceeds monotonically from specific to general
hypothesis. The candidate revision of S must be tested for consistency still with the past
positive and negative instances. So in light this respect that this algorithm shares the
properties of the depth first. One advantage of this strategy over the first search stems from
the fact that one can see say that the set S represents a threshold in hypothesis space
generalizations more specific than this threshold are not consistent with all observed positive
instances where is those more general than this threshold are. Let’s now look at the pseudo
code for the breadth-first search strategy. So we start by initializing S to a set of
generalizations consistent with the first training instance. For each subsequent instance i we
have two cases either we encounter a negative instance and if we went out the negative the
instances we are we need to modify the set S so that we only retain those generalizations
which do not match this new negative instance. In the other case if we encounter a positive
instance, then there are two things to do, so first of all what we do is we generalize the

92
members of S, the who which do not match i along each branch of the partial ordering. But
only to the extent required to allow them to match i which means that we're conservative we
generalize of course in order to cover new but not too much so we keep to the most specific
generalizations needed to be done. And after that we'll also have to release it S and look at it
and see to that if there is a more general element of S then the newly created, then we should
remove that element or it could also be so that our elements in S now that matches a
previously observed negative instance and that we cannot allow, so also those elements have
to be removed. So this is pseudo code for this algorithm.
Let us now look at the example again. So we have three instances, two positive or negative
and this allows us to look at three iterations of running the algorithm. So we can then also
create three versions of the hypothesis space, S1, S2, S3. So the first thing we do is that we
instantiate S to the most specific generalization consistent with the first example. Then we in
the next step we need to revise S1 in response to the second post instances. Here S1 it
generalized along each branch of the partial ordering to the extent needed to match the new
positive instance. So the resulting set S2 is the set of maximally specific generalization
consistent with the two observed positive instances of form. And of course you should also
observe here like in the examples for depth first, that the every instance is an unordered pair
of objects, so the order of the two parts within the instant doesn't matter. So the third data
item is a negative training instance, in this case one of the members of S2 was found to match
the negative instance you can see that if you study the previous slide and therefore need
needed to be removed from the right set of S3. So when we look then the resulting was in S3
there is no possibility of finding an acceptable specialization of the discourage generalization
and no more specific irritation is consistent with the observed positive instances. At the same
time no further generalization is acceptable, since this will also match the new negative
instances. So all of these things have to be taken care by the algorithm.
Let us now turn to the third strategy the so-called version space strategy. Actually the version
based strategy is an extension of the breadth-first search approach and it's a combined
specific to general, in general to specific approach. So in addition to the set S that we used in
the breadth-first case, we define a set G, Center members of G are as general as possible. So
G is all hypotheses consistent with the observed instances such that there is no generalization
which is both more general term G and consistent with instance. The set S are handled
exactly as it was in in the breadth first case together the sets S and J precisely delimit what
we call than the version space. So generalization X is contained in this version space
represented or bounded by S and G if and only if x is more specific than or equal to some

93
member of G and also more general than or equal to some member of S. So the advantage of
the version space tragedy lies in the fact that the set J summarizes the information implicit in
the negative instances that bounds the acceptable level of generality of hypothesis, while the
set S summarizes the information from the positive instances that limits the acceptable level
of specialization of hypothesis.
So this can then be depicted in a simple slide like this, where you can see S and J and you can
see the area of S and J and between them which are the consist of generalization. So upwards
more general that much more specific. So and exactly what happens is that when we go
through the algorithm iteration for iteration positive examples tend to move the S set upwards
a negative example tend to move the J level down, which means that step by step the possible
hypothesis are squeezed in between S and G where the gap becomes more and more narrow
and of course to make in the end there should only remain one optimal hypothesis.
Some more comments on the version space strategy then before we go to the example. So
testing whether a given generalization is consistent with all the observed instances is logically
equivalent to testing for it lies between the sets S and G in the partial ordering
generalizations. The version space method is assured to find all generalizations within the
given generalization language, that are consistent with the observed training instances
independent of the order of presentation of training instances. As you may know has many
machine learning algorithms may depend in the order in which we look at training instances
but in this case it doesn't matter. Also the sets S and G represent the version space in an
efficient manner summarizing the information from the observed training so that no training
instances need to be stored for later reconsideration. As you remember from the depth first
and breadth first, there is a need of different kind to store the training instances for further
inspection. However there are few restraints that should be noted for this technique, so as for
the breadth-first search this technique will only work when the more specific than operator
and what general an operator can be computed by direct examination or hypothesis because
we just said that we don't need to store the instances and so therefore all operations have to
rely on this explicit hypothesis space that we construct. In addition this technique assumes
existence of a most general and most specific generalization. And it may be the case that
these hypotheses do not exist.
Let's now look at the pseudo code for the outline of the version space strategy and the
important parts here are the two blocks. So there one block for actions for in the case that we
encounter negative instance and one black collections and we encounter a positive instance.
And as you see here there is a nice symmetry between these blocks with respect to how the

94
two sets S and G are handled. So if we look at the first part of each block so you can see that
what we need to secure when we see a negative instance is that we have to retain in S only
those generalizations which do not match G because we cannot allow for any hypothesis that
match a negative instance okay. So symmetrically then when we see a positive instance we
have to see to that in G we only have such generalizations that match G because we don't
know what generalization that do not cover one of our positive examples. Then there is a
second case you need block of actions. So and for this a case for the negative case we look at
generalizations of G that match i more specific only to the extent required, so essentially what
we do when we see a negative instance we make generalizations of hypotheses in G but we
make it in a conservative way. In the same fashion as we see positive instances we generalize
members of S that do not match G but also conservatively only to the extent required to allow
them to match i and only in such ways that each you may more specific than some
generalization in G. And finally there is a third case in both blocks because afterwards all of
this is done one and two we then have to see to that we don't have unnecessary elements in G
and S, so in the first block band when you move from the any element that is much less
perfect specific than some other elements in G, because we only and only allow G to go down
so to say in the level from you're not to specific if this is motivated by the instance
encountered and then in the same way remove S any elementary is more general than some
other elements in S because you also are conservative in the way we generalize S .So this is
the essence of this algorithm.
Let's look at the example again now for the version space. So we have the same example you
can see three instances, what happened now is that if we look at the hypothesis space the set
S is the most specific generalizations are exactly the same as for the breadth-first situation
because essentially as I already said the version space approach is actually an extended
double version of breadth-first, where we both look at the most specific and the most general
hypothesis at the same time. So what actually the only difference now with this trace is what's
happened to the to the set G. So the situation is very similar but we have to consider the set
G. And what we choose to do here is to set G to the most generalization describable within
the given language and that's all question marks and that matches every possible instance,
because it's consistent with two positive training examples shown in this figure, the G is
unaltered in the first two iterations. As was already said the set S is revised as in the breadth-
first search in all the three iterations. So the difference here is but in the third iteration G2 is
revised since the negative instance reveals the current member of G2 is a version the
generalization in G2 specialized along the possible branches of the partial ordering that leads

95
came up somehow down towards a member of S3. Along each such branch its specialized
only to the extent required so that generalization no longer matches the new negative
instances so in done in a conservative manner. The version space at this point contains the
members S3 and G3, as well as all generalization that lie between these two sets in the
partially for the hypothesis space. Subsequent positive trainings and may force S to become
more general while subsequent negative training in may force G to become more specific.
Given enough additional training instances, S and G may eventually converge to sets
containing the same description. At this point the system will have converged to the only
consistent generalization within the given generalization language.
I hope now you've got a feeling for the behaviour of these kinds of algorithms. I will end this
lecture by assigning a short comment on performance in general we don't focus much on
performance on this course but I want to mention one thing because it's also anyway repeat
something already said which has some importance. So if you look at this little table on the
slide and you look specifically at the storage space, you can see that for the version space
strategy when you look at the bottom right corner you see that the order of the space needed
is only proportional to the size the number of elements in the hypothesis space more
specifically of the order of the number of elements in S and elements in G, because only
those elements only defines the abstractions needed. If you move one row up you can see that
for the breath first you need to store of course the most specific generalizations in that case
but actually also have to store all the negative cases because in every step you have to check
that what the generalizations you and you proposed it need to be consistent with the negative.
While in the depth-first search you have to really store all instances, because in every step
you have to release both the positive and negatives. So this performance issue in a way
repeats some of the differences between the two the three algorithms so actually this was the
end of the second lecture of the fourth week and so the next lecture 4.3 will be on the topic
decision free learning algorithms so thank you and goodbye.

96
[Music]

Welcome to the third lecture of the fourth week to the course of a machine learning this
lecture will be about decision tree learning algorithms. For practical reason we'll break this
lecture down into two parts, this is part one. The agenda for this lecture is as follows you'll
first talk about decision trees in general and then we will focus on a specific kind of
algorithms called top-down induction of decision trees algorithms, and then we will talk
about some information ,theoretical measures needed for making crucial decisions in these
algorithms. Then we will have this break for part two and then in part two we will look into
the ID3 algorithm which is a default or prototypical algorithm of this kind and after that we
will talk a little about one of the key problems for this kind of algorithms which is over-
fitting and how to manage that problem through something called pruning, and then finally
we say a few words of some alternative algorithms within the same category.
The challenge here is to design a decision tree such that the tree optimizes the fit of
considered data items and the predictive performance for still unseen data items, we recall the
minimum prediction error. Decision trees analysis are two main types, so on one hand we
have classification tree analysis when the leaves are labelled according to the k Target classes
included in the data set. Regression tree analysis when the leaves are real numbers or
intervals. Decision trees represent the disjunction of conjunctions of constraints on the future
value of our instances, but the decision tree can also be seen as an equivalent to a set of if-
then rules I mean if the rules are also many times used for decision situations. So in this case
each branch represents one if-then rule where the if part responds to the conjunction of each
tests on the nodes from the root to the leaf and the then part corresponds to the class label or
numerical range of that branch.
Decision trees and the corresponding learning algorithms have some positive and negative
properties if we looked at the positive side first, this kind of formalism is very easy to
interpret for humans and as a consequence there's also many aspects of the algorithms are
easy to grasp and follow. It's a very compact formalism it's also very natural to handle
irrelevant attributes you will see that when we go into the early one of the key things to
handle the choice of attributes. And also there are reasonable ways of handling missing data
and other kinds of noise, and these are ways are also very fast at testing time, of course there
are some restrictions so one thing is of actually that the only way the decision tree can split
the data items is by following the axis of the feature space. So the divisions of the feature
space is actually rectangles that follows the dimensions of the feature space. Another thing is

97
that typically these algorithms are greedy in the sense that I always try to find an optimal
local decision, which means that and they don't backtrack , they don't they never go back and
change earlier decisions. So this means that because they are greedy, they may not find the
globally optimal tree.
So we're learning from this representation, we will now focus on a particular category of
learning techniques called top-down induction of decision trees TDIDT. The scenario for
learning here is supervised non-incremental data-driven learning from excerpts and we have
touched all these subjects earlier so I hope you get the picture. The systems are presented
with a set of instances and develops a decision tree from the top down guided by frequency
information in the examples. The trees are constructed beginning with the root of the tree and
proceeding down to its leaves. The order in which instances our handle are not supposed to
influence the build up of the tree. The systems typically examine and re-examine all of the
instances as many stages during learning, so actually all the instances have to be stored in this
process. Building the tree from the top and downward, the main issue is to choose and order
features that discriminate data items in an optimal way. Subtopics that we will touch is the
follow are the following, use of information theoretic measures to guide the selection and
ordering of features, how to avoid underfitting and overfitting by pruning on the tree,
generation of several decision trees in parallel is another approach an example of that is
called random forest, and finally we will say something about how you can build in some
kind of inductive bias in the algorithms. One property of nodes in a decision tree that is
important for the rest of our discussion, is the concept of purity or homogeneity. So when we
start a buildup of a decision tree, and we start with a data set and we start with the root of a
tree and initially the entire data set that all training instances the entire dataset is associated
with a root so this you can see in the picture the small example to the right. But for every
decision split we do, in this process or building up the tree based on the shows and feature
and its values, the data set is partitioned and the subsets of the datasets become associated
with the nodes of the split, the new nodes and this is repeated recursively down to the
leaves through the series of decisions we make. So impurity or homogeneity refers to the
distribution of data items of the k target classes they are symbolized by colors in simple
example to the right, both for the root and for each of the nodes. Less degree of mix of
classes for all the particular nodes, implies higher purity for the nodes. So this means that if
the elements over an node, the instances associated with the node is of just one class, you
purity maximum purity but if you have early even mix, an even balance of all the classes
involved, if your two class you can say you have 50/50 if you have three classes you one-

98
third one-third one-third then you have a minimum purity or maximum impurity. Sometimes
you people want to talk about homogeneity instead but in this presentation I highlight the
term purity. So most algorithms in this genre aims at maximizing the purity of all nodes and
of course then we need some means for measuring the impurity and we will now turn into a
few alternative schemes for measuring the purity or impurity.
As you understood, we need some information theoretic measures ways of measuring the
decision tree nodes and the associated instances in order to be able to understand which is the
optimal choice of features to use for discrimination when building up the tree. So what we
will do now a look at a few two very related but slightly different schemes for doing so. One
is we call information gain based on entropy and the other scheme we call the Gini approach.
There are also other ways of doing this there's something called variance reduction but we
will not too further about that I mean. Most of these techniques are more or less equivalent
there may be slight difference when we apply that one or the other for very in various
applications, there will also be some differences in computation efficiency some may be
better but more computational expensive and vice-versa, but I think for the purpose of this
course the most important thing is to understand that in order to build a decision tree
according to this kind of methodology, you need some kind of measure that we will
exemplify with two cases.
I wanna talk about one of the most frequently used information to take measures in this kind
of algorithms and is often referred to as information gain and entropy measure. So that two
parts, here so the first part if we start from the top is the information gain. So really we want
to measure that guides us in choosing the feature with the best discriminative power in each
decision point, and as we said earlier and actually the goal is to maximize the purity of nodes,
minimize the impurity of nodes. And we want to a measure that tells us whether which
feature is as the greatest discriminative of power. So actually the idea with the information
gained is to say oh look if we have a measure of purity on node where we are let's look at
what happens if we take this feature and split into in respect to its value, then look at the
subsets or instances that we determine attributed to each a new node that we branch from the
node we are. And let let's look at what happens to the entropy, if we do this split, then
looking at the entropy of course of overall the new subsets of nodes associated with the split
values. So this is essentially the idea of information gain but first of all of course need a
measure of purity or impurity. So the measure of purity here is in this approach is called
entropy, borrowed from information theory and actually it's defined as for the binary
classification case we have two classes or one class and then we are positive example of that

99
and negative example of that. So the measure or purity here is chosen to be the negation of
the probability of a positive example times the binary logarithm of that probability and plus
the probability of having a negative example time to a binary logarithm of that probability. So
this is the entropy definition.

And for binary reclassification we have many classes is just the negation of the sum of all the
probabilities for having a specific class times logarithm of each of those classes. Okay, so
then the only thing that we need then is to create a formula where we infer the gain, to show
the future at a certain point based on the split of this feature. So you can see on the slide you
can see this formula saying the gain at a certain point which corresponds to a certain set of
instances, with respect to a feature is the entropy of the original point, the original node
minus the sum of all the entropies for the subsets of nodes given that we split the instance
with respect to the feature values of the selected feature. But moderated by the division of the
cardinality of each little subset divided by the cardinality of the instances. So this is the
definition of information gained based on a name to be purity measure. Actually the most
frequently used measure and in the literature of this kind is the information gained measure
with entropy as we described in the in the last slide. The alternative used by some people are
called the Gini impurity measure and then we have something we call the Gini gain, which as
you can see on the slide the differences is not so much in how we decide how we define the
information gain, because the information gained is we define with the same principle doesn't
matter what kind of measure we have of the impurity. So the difference here is how we
measure the impurity of a certain set of instances. So you can see that more or less the nation
of Gini gain instead of the general information gain is the same, the difference here is that
there is no Gini impurity. So Gini impurity is a measurement of the likelihood of an incorrect
classification of a new instance of a random variable if that new instance were randomly
classified according to the distribution of class labels from the date set. And this becomes
then one minus the sum of the squares of the probabilities for the instance belonging to a
certain class. Mmany times the Gini method gives more or less the same result as the other.
As you can understand is a little simpler to compute because you don't have to compute lover
but I take it up here because it's often mentioned and I think the message is it's not so
important it's use if you don't have specific knowledge and really go into the details, but I
want to mention it because it occurs in the literature and the important message is there the
summary is to say you need this kind of measure in this kind of approach and these are the
two the ones I've described here are the two most commonly used.

100
I will now introduce an example and so I can exemplify how these measures can work out. So
this example we have 14 instances, the training instances and each instances is expressed in
terms of four features, outlook, temperature, humidity wind, which have had a few feature
values and then we have a binary classification, for this task which whether you should
perform certain activity or not. So this is the case we have.
Let's now look at the example and how we compute the various measure. So then the first
thing to do of course is to look at the root of all potential tree for this dataset, and as you
remember there were 14 data items in the data set, nine of them were positive which means
that it's a positive outcome they're performing something like playing tennis and five
negative. So then we can compute the entropy of the root node and you can see this done here
according to the earlier given formulas. Then we illustrate on the slide how we can try to
evaluate the information gained for one specific feature let's say the wind, which has a strong
value and a weak value. So for the weak value we have to look at which instances would
belong to a node branched out based on that feature value and actually it turns out to be eight
instances in total, six of them positive two of them negative. In the other case we would have
six instances, three of them positive three of them negatives. Okay, so then when we use the
formula introduced the gain, the information gained by splitting according to the to the wind
feature the that information gain will be the entropy for the whole set that we already
calculated, minus the sums of the entropies for these two subsets that we described about
moderated by the cardinalities of the two sets divided by the original data set, so we end up
with a value of 0.048.
If we do the same thing for all the 4 features, you can see down here for Outlook, for wind for
Humidity, temperature day the highest value that we get is 0.246, which we get for outlook
and therefore according to this method, outlook becomes the preferred feature to discriminate
and the first feature that we are going to split from the root.
So on this slide I exemplify exactly the same thing for the Gini impurity measure as for the
entropy measure. I will not spend a lot of time on this because it's more or less equivalent to
the other case actually do the same thing instead of calculating the impurity of S, you have to
the entropy of S you have to calculate a Gini impurity S with another definition, and then of
course you also have to calculate the Gini impurities instead of the entropies for the subsets,
of the subsets of the original set based on the feature values in this case, but then in the end
when you want to calculate the gain it's a similar formula that you used earlier it's just that we
use the Gini impurity measure instead of the entropy instead and typically we get pretty
similar results with these two methods. We will now make a short break here and stop this

101
video, so we will continue to discuss learning for decision trees in part 2 of this video thanks
a lot bye

102
[Music]

Welcome back to the third lecture on the fourth week the course in machine learning. We will
continue our discussion about decision tree learning algorithms. We will spend most of the
time to talk about the ID3 algorithm. So ID3 which is short for Iterative Dichotomiser 3 is a
TDIDT (Top-Down Induction of Decision Tree) algorithm, invented by Ross Quinlan in
1986. The TDIDT algorithm returns just one single consistent hypothesis and considers all
examples as a batch. This kind of algorithm employs a greedy search algorithm, which means
that it performs local optimizations without backtracking through the space of all possible
decision trees. Obviously it's susceptible to the usual risk of hill climbing without
backtracking and as a consequence finds tree with short path lengths typically but not
necessarily in the best tree. This kind of algorithm selects and orders features recursively
according to a statistical measure called Information Gain, which we described earlier in part
1 of this lecture, and until each training example can be classified unambiguously. Some kind
of inductive bias is built into this algorithm. On one hand it's prioritized simplicity, which
means this applies Occam’s razor by always choosing the simplest tree structure possible. But
also as already mentioned it's systemically prioritize for high Information Gain when
selecting features to discriminate among instances.

I assume you all know William Occam, a monk in the 15th century started to study logic, and
who for was the first to phrase the principle of choosing always the simplest solution to any
problem. So when there is a choice of alternatives, always use the simplest variant. This is
relevant for discussion we are now but as you hopefully already know, is a very useful
principle in a number of situations.

The ID3 algorithm starts with the original data set as associated with a root node. On each
iteration of the algorithm the algorithm, it considers every unused feature and calculates the
Information Gain of that feature. It then selects the feature which has the largest Information
Gain value. The data set is then partition by the selected feature to produce subsets of the data
that is then associated with the branched out nodes corresponding to the values of the chosen
feature. The algorithm continues to recur on each subset considering only attributes never
selected before. Recursion on a subset may stop in one of these cases, case 1 if every element
in the subset belongs to the same class, the node is turning to a leaf node and labelled with a
class of these examples. If there are no examples in the subset it's an empty set a leaf node is
great in a label with the most common class of the example in the parents nodes set. If there

103
are no more attributes to be selected, but the example still do not belong to the same class, the
node is made a leaf node and the label with common class of examples there in the subset.

So throughout algorithm the decision tree is constructed with each non-terminal node
representing selected feature, on which the data is split and terminal representing the class
label they're suited for the final subset of this branch. Let us now take a look short look at the
pseudo-code for the ID3 algorithm. I already can formally characterized how the algorithm
work but hopefully by going through this pseudo code shortly you need consolidate the
understanding of how it works. So actually this algorithm takes as input a set of instances, set
of classes and a set of features and what it does is that it returns a tree actually. It says return
node in the end but actually the node well due to the function of the procedure the algorithm
returned actually a tree. So it creates the root node in the first iteration and then it checks if
only instances that it I belongs to the same class. So then you have actually node with
maximum purity, then you return a single node tree with class label belonging with a class
label belonging to those instances. If the feature set is empty, there are no more features to
choose from in order to split the tree, then you also have to stop the algorithm and what you
return is this single node with a label that is the most common label class label still in the set
of instances that you have. Otherwise you still have instances, you still have classes, you still
have feature, so you select the feature you select a feature by first calculating the maximum
Information Gain again for each feature and then you select the feature with high as the
information gain. And for each value of that feature, you add a new branch belonging to that
feature, then you create a new subset of the instances that satisfies that feature value, if that
instances is empty then you just create a leaf node with the most common class label you
found among those instances. Otherwise you make a recursive call to the ID3 algorithm
again, using the arguments now the instances but only those instances to belong to this branch
of course you return the also have a parameter of the classes, but then you also include the
features but of course you have to remove the feature already used because putting this
algorithm never reused in in this process. And finally the algorithm stops by returning the
node. So that's the functionality.

Let us now reverse it the simple example again that we introduced in part 1 of this lecture, so
this was the example where we look at this 14 instances which are characterized by a number
of features outlook, wind, humidity or temperature and so on. And in this slide it's depicted
which kind of decision tree is produced by feeding those examples to the ID3 algorithm. So
as you see we had 49 data items to start with 9 positive 6 negatives obviously the information

104
gained test decided that outlook was the most relevant feature to use in the first place to
discriminate among the instances and recursively down the way humidity was chosen,
eventually wind was chosen, so and as you can see given the decision or considering those
features in that order, the dataset post partitioned, so that 5 data items was associated with the
outlook sunny case, 4 with these outlook overcast case rain, 5 items with the rain outlook
case and so on. As you see we ended up with actually 5 leaves of this tree, and it turns out
that in this case only 3 of the four features were used to produce the decision tree. I should
the comment on the phenomena of noise. Non-systematic errors in the values of features or
class labels are usually referred to as Noise. Typically two modifications of the basic
algorithmic are required if the tree building should be able to operate with the noise affected
training set. The algorithms must be able to work with inadequate features, because noise can
cause even the most comprehensive set of features to appear inadequate. And secondly the
algorithm may be able to detect if testing further attributes will not improve the predictive
accuracy of the decision tree but rather result in overfitting and the algorithm must as a
consequence be able to take some measures for that like pruning.

Now we turn to the problem of overfitting. So one serious problem for decision trees is the
risk of overfitting, the practical significant practical difficulty and for decisions is many other
predictive models. Overfitting happens when the learning algorithm continues to develop
hypothesis that reduce training set error at the cost of an increased test set error. So let's
introduce a few definition so let's say that the average error for in our processes the training
data is called ET and the corresponding average area for the training data plus the data in
kind of all data, all together is ED. So we define overfitting in the following way, so if we
have an hypothesis H, we say that this hypothesis is overfitting training data if the ET(h),
which means the training data error for this hypothesis is less than some other than ET for
some other hypothesis ET(h ‘) but the ED the error for test data of h is larger than the ED
for tests for some other hypothesis. So this means actually that there are better other
hypothesis now with respect to the accuracy for test data. So I said you can see in the
example to the right and where you can see on the x-axis the size of the tree so when the tree
grows the accuracy becomes better for the training data but it's actually more like a goes
down for the test data.

So for decision tree the most important approach to handling overfitting is through pruning.
So pruning is the major approach in this case and the idea with pruning is of course reduce
the size of the decision tree without reducing predictive accuracy as measured by a test set or

105
cross-validation set. So there are two kinds of pruning, so on one hand we can have pre
pruning where you stop to grow the tree earlier before it perfectly classifies the training data
set, which means that no longer data split is statistically significant. So criteria for stopping
are usually based on just the statistical tests like the chi-square tests and of course the aim of
this test is to give a ground for decisions whether to expand particular node or not. The
problem with this test even if they those tests exist is it's still not a trivial problem. So the risk
of stopping too early is imminent. The alternative which is essentially a much more popular
approach is called Post-Pruning, where we allow the tree to grow so it perfectly classifies the
training set and then afterwards we post-prune the tree by removal of sub trees. I mean
common approach here is to set aside some part of the data set that we could call a validation
set, evaluate and use that validation set as a basis for the decisions in the post pruning phase.
One variant, simple variant of post pruning is called Reduced Error Pruning and the scheme
for that looks roughly as follows. So actually the data set is split into a training and a
validation set. And the validation set is used as a basis for the pruning procedure. All nodes
are iteratively considered for pruning. The node is remove if the resulting tree performs no
worse than the original on the validation set. Pruning means removing not only the node but
the whole subtree for which no one is the root making it a leaf and assign the most common
class of the associate instances. Pruning continues until further pruning is considered as
deteriorating accuracy. So this is a pretty simple procedure but it's pretty freely reflects many
schemes for post pruning. Before we leave the talk down induction of decision tree
algorithms, I will say want to say something about the some other algorithms. So actually
ID3 I would say is the prototypical algorithm and therefore I've been using it for
exemplification, however there were much earlier systems of this kind one of those called
CLS and actually ID3 is an extension of CLS. And also Quinlan who developed ID3 fold up
this work and there are actually more current versions that are in practical use that's called C
4.5 and there is also something called C.5 and actually C4.5 was at a time consider the default
machine learning algorithm, that means a very popular tool to use for machine learning. Then
there are others ACLS, assistant, CART etc. So there are many alternatives in general one can
say that the later system extends ID3 in various ways but the primary ways are actually that it
extends the kind of data types that are allowed for the features because ID3 was pretty
restrictive here. Also the later algorithms are much better with respect to pruning and they're
also much better with respect to noise handling.

106
So to further emphasize what I said for the last slide you can see on this line a subset of these
systems are compared, so let's just look at ID3 and C4.5. And then you can see that on the
point where C4.5 is working a better performance than ID3 because it can handle more wider
range of feature values, it can handle missing values with more or less me is equivalent that it
can also handle noise and obviously it has a more sophisticated pruning techniques. Still there
are certain things that are not handled in C4.5, like also handling the case of Outliers which is
also a common phenomenon, but as you can see another system a CART has mechanism for
handling that. So far we have talked primarily about approaches to building decision trees
which have the purpose to build one single tree. As you remember most of these methods
performs in a greedy fashion which means that they take local decisions to form the tree
therefore cannot be guaranteed to give to result in a a globally optimal tree. So there is
another category of approach is called Ensemble Approaches, where assemble methods
which construct more than one decision tree and use the set of trees for your classification. So
there are two kinds of approaches relevant which are not only relevant for decision trees but
for different kinds of classifiers. So we have boosting approaches where boosting means that
we take a sequential approach where a sequence of average performing classifiers, it can be
one decision tree, second decision tree, third decision tree, can give a boost performance by
feeding experience from one classifier to the next. So we use experience on the first to
improve the performance of the second. One example of such a system is called Ada Boost
which is not only applicable to decision trees but to many ML algorithms. The other variant
is called Baaging approaches where bagging is a parallel approach, where a set of classifiers
together can produce partial results fully in parallel that then can be handled, can be used to
form the basis for a total negotiated result. So one example of that is called Random Forest
algorithm which combines random decision trees with bagging methodology to achieve very
high classification accuracy. So to take a concrete example of an ensemble method we can
look at random forests or random decision forests which are techniques can be used for
classification, regression and other tasks. So random forests operates by constructing a
multitude of decision trees at training time and outputs the class that is the most common of
the classes or mean predictions produced as results from the individual trees. So the random
forest approach is also an alternative remedy for the decision tree problems of overfitting, as
you can see and you can see to the left you can see illustrated how the outcome of the
individual trees are used as a basis for some kind of majority voting to decide upon a final
class or a final regression value, and to the right you can also see in the strategy that in many
cases the accuracy becomes better, so when using the random forest approach you can get a

107
much more smoother result that is closer to the desired output. So this was the end of lecture
4.3 and thanks for you attention. The next lecture 4.4 will be on the topic of instance based
learning thank you and good bye.

108
[Music]

Welcome to the fourth lecture of the fourth week of the course in machine learning. In this
lecture we will talk about instance-based learning, for practical reasons this lecture is divided
into two videos and this is part one. The subtopics of the this lecture are as follows, first I will
talk in in general about instance-based learning what it is, and say a few words also about the
character of the so called instance space, then we look in more detail on a particular
algorithm called the K-Nearest Neighbor algorithm and in conjunction with that discuss an
important aspect of this kind of systems, the Distance and Similarity matrices. And then we
will expand and look at a more general Neighbor K nearest neighbor algorithm called the
Weighted Nearest Neighbor, after that we will go into a kind of different area and we will
talk about support vector machines which is a kind of binary linear classifier ystem and we
will I will talk about something called Kernel Methods where Kernel Methods is useful and
in order to make machine learning algorithms that in basically handles the linear case also
being able to have a nonlinear cases.

Instance Based learning is a family of learning algorithms that instead of performing explicit
generalization, compares new problem instances with instances seen in training which have
been stored in memory. As a consequence of that this kind of technique is also often called
Memory based learning. The reason for the name instance based learning is because this kind
of technique evaluate is cause it directly based on the training instances themselves, and not
explicitly towards to build up hypothesis. One can say that instance based learning is a kind
of lazy learning where the evaluation is only approximated locally and all computation is
deferred until classification. tTere is no computation needed to build a hypothesis obviously.
And in the worst case a hypothesis is the list of n training items, so therefore the
computational complexity of classifying a single new items is at least of the order of n. One
advantage that instance-based learning has over other methods machine learning is its
flexibility to adapt its model to previously unseen data because actually the character of its
classification change with a built up of the memory of instances. And an instance based
learning may store new instances on or throw away instances, it depends in of course of the
details of the algorithms. In many machine learning approaches the internal structure of the
instance space is not explicitly considered. However the character means and space is always
important it is always implicitly influenced the performance of learning algorithms, even if
the character of the instance space is not explicitly considered in the algorithm design. But
here can contrast for instance based learning the character of the instance base is of key

109
importance. In earlier lectures a few crucial aspects of the instance base have been
mentioned, the number of it features, the value set of features, instances of special status, we
talked about prototypes, outliers and near misses which are all structural aspects. And also we
mentioned similarity distance measures which is a key issue here but we will also now talk
about structural properties of the whole space such as spareness density etc. These aspects
will now all come into play. I will now concentrate the discussion on one particular kind of
instance-based learning algorithm all the k nearest neighbor algorithm KNN. In this
algorithm, the analysis is based on the k closes training examples in the instance space, what I
mean by closest is closest to the query item. k is a predefined positive integer that have to be
set before you use this kind of method normally issues k to be small and odd. Potentially an
optimal k can be calculated by special technique hyper parameter optimization techniques. So
as you may have understood by now many of these parameters can also be learned that they
the parameters control a learning technique but they the families themselves can be learned so
special techniques. So this is a second-order kind of learning that is possible. The typical
representation of any instance in this case it's a feature vector with a number attributes, or
features plus labelled or a number which constitutes the target function we still here talked
about supervised learning. So we still suppose that all our instances are labelled. The training
phase is simply the storage of the feature vectors of all training in a data structure nothing
more happens at that point. For this kind of approach we only is a distance metric that means
a metric that that measures the distance between the instances, the default metric is Euclidean
distance the well-known distance it's the distance between two elements in a Euclidean space,
the square root of the sum of the differences between the Euclidian coordinates of the two
elements. Normally also you select and define a metric like this one but also as for the
number k, the metric can also be learned hypothetically. The k-NN can be used both for
classification and regression and in the kNN classification case the output is of course a class
membership. Here a query instance is assigned the class label, most common among its k
nearest neighbors. If k is equal to one, of course then it's which is a special case then the
instances assigned in the class of that singles nearest neighbour. And one can say in a way
using an analogy the that all the k-nearest neighbors vote for the choice of the class label in
the case of k being larger than one. And in k-NN regression, the output is the property value
for the query instance, and this one is the average of the values of its k nearest neighbors. For
example for practical reasons we will use simplest case which is binary classification in a
feature space of two dimension. So to the right you can see just a summary of the stages in
this kind of algorithm, you look at the look at the data, as per in the space, calculated

110
distances, find the appropriate number of neighbors and given that you have chosen the
neighbors, you calculate the average of the outcome and take that as the class label in this
particular case of making a classification task.

From this slide you can see our very simple illustration, over classification application for the
k nearest neighbor algorithm. It illustrates the outcome of the algorithm for different values
of k in this case value 1, 3 & 5. So as you can see for the inner circle there is own and you
only look at the closest a neighbour, which in this case happened to be an instance classified
as blue. And when you select the number three, you take consider three neighbors and then it
happens to be that two more ready instance is coming to play, so the outcome the algorithm
will be will be red. While if you consider five instances two more blue come into play, so
then suddenly it walked over again towards the blue case of the outcome of the algorithm it
will be blue. So this shows that the result outcome of this algorithm will clearly change
depending on how many instances you view include in the process. It's obvious that in the
case of instance-based learning there is only one explicit space, the instance space. As been
said a hypothesis space is never explicitly built up. One can say that the hypothesis are
implicit in the structure of the instance space as defined by the chosen matrix. One form of
illustration for these implicit representations is the co-called Voronoi diagram. A Voronoial
diagram is a partioning of the decision surface in two convex, polyhedral surroundings of the
training instances. Each polyhedron covers the potential query instances possibly determined
by training instance. And query point are also specifically is closer to another training
instance and it's of course also included in some polyhedral. And this kind of diagram is not
explicitly used for the processing on the algorithm, but it shows an important purpose in
being able to illustrate for the user of the system, how the situation looks like. We exemplify
here all the time by two dimensional cases, but as you understand it's natural to extend this
kind of technique to dimensions larger than two, so you can see at the bottom to the right an
illustration of that case.

One important feature of our aspect of instance based learning, is the existence of distance
measure, similarity measure, approximate measure and it you have always to define this kind
of measure for all instances in the instance space. And as you can see from the figure to do
that to the right here, this means that that the instance space always have to be a metric space,
this means a space with a matrix. And however the most typical case is that our instances are
feature vectors, it doesn't necessarily have to be so because an instance can could be anything
and you can also think of other kind of entities, on which you hypothyse a matrix, but the

111
very very common case is the feature vector. So therefore I want to talk a little on this slide,
about two kinds of vector spaces that are well known from mathematics it's called a Normed
space and an Inner Product Space. So the more general category is a kind of vector space that
has something called a norm. So a norm is a real valued function that intuitively couple to a
length or a distance. And a norm satisfies the number of key properties, one of them is
number 4 here is a triangular inequality, this means that the distance is the norm of two
vectors is always less or equal to the sum of the norms for each one of them. And we have a
special notation for the very special case of a Euclidean norm, Euclidean norm as it is
actually a norm, the traditional norm we know from a Cartesian space where the distance is
the square root of the square root of the sum of the arguments or features of the vector.
Obviously a distance here naturally is the norm of one vector minus the other. So this is this
is normed space and then we can go further and we can look at normed space whether also
exist an operation called the Inner product. And as you can see down there and then an inner
product or dot product, in your Euclidean case they are equivalent, is first of all a scalar value
and also the classical definition is that the inner product of two vectors are the product of the
norm multiplied by the cosine of the angle between the vectors. And so on as a special case
when the angle is 90 degrees, the cosine becomes zero, so then inner becomes the products as
a consequence and we call this in this case the two vectors are orthogonal that's the
terminology. So we need these concepts because when we define distance measures, we need
both the cause of them norm and we also in some case the concept of inner product. And you
can see in the picture to the right, inner product space it is a subset of the Normed vector
spaces and Normed vector spaces is a subset of the Metric spaces. Let's turn then into
Distances and Similarity matrices. So a distance metric, this is measure, function are
synonyms, is typically a real valued function that quantifies a distance between two objects.
So this is between a point and itself are zeros, of course all other distances are larger than
zero, distances are symmetric and then we have a triangle inequality which informally means
that detours cannot shorten the distance, so yeah it's always longer to go from one point to
another by another point. So Distance and similarity metrics have been developed more or
less independently so in certain cases, it's more practical to talk about the distance between
one vector and another or a one point with another, in other cases is more natural to talk
about how similar are things. So therefore depending on the context and these people have
developed on one side distance metrics, on one side similarity metrics but obviously they are
always connected intuitively. And obviously they are somehow inverses of each other and
can be transformed into each other, but you will find and maybe it's a little confusing that's

112
sometime talk about this distance metrics separately and similarly define them separately, and
but of course that can always be a transformation between them. So typically similarity
metrics takes values in the range of -1 to 1, so they are always normed somehow, and where
one means that the objects are regarded as identical and -1 means is the maximum distance
considered by the corresponding distance metric. So if you have a same similarity of -1 there
your are as far as you can, think of in that context. While distance metric often take values
from 0 to infinity or anything, but of course there is always done some transforms and
normalization to make distance a similarity matrix comparable. We will always here
exemplify by using metrics in a normed Euclidean vector space matrices based on
overlapping elements. So those are the two cases we will focus.

Now I want to give you some examples of distance measures so first we start with the
category which is called the Normed and Inner product vector spaces. So these are kind of
classical measures suitable for instance spaces where we have normal feature vectors. And
now there's something Minlovsky distance which is maybe not so well known to you, but one
can say that the Minlovsky distance is defined in such a way that it is becoming a umbrella
concept for three more well-known the types of distances. So actually Minlovsky the distance
is defined as you take the sum of the differences between arguments for two vectors, you
raise each sum to the degree of k, you make the summation and then you take the kth root out
of that sum. When one that looks at different values k, you can see that if you take the value k
you get your Euklidean distance, so then you get simply the sum of the squares of the
differences and then you take the square root out of that sum and the classical theorem of
Pythagoras as a distance. Then so that's well known if you take k us number one, you get
something else that is also pretty well known call the Manhattan the distance. Actually the
Manhattan distance is then the sum of the absolute differences of the argument
differences.And it's called Manhattan distance because if you consider you are moving
around from one office in a row in Manhattan the most important thing is not the distance on
the ground it's also the distance in the houses. So you go from the hundredth floor in one
building to the 250 floor in another it's more important with the vertical distance and the
horizontal, so that's a joke called as the Manhattan distance. Then we have a third case this is
also rather well known that if we choose the value k to be infinity and then the formula
deteriorates are actually ending in this limit of infinity we get actually the definition of the
distance as the maximum value of the differences in our arguments. This distance is also
called the chess board distance because actually this is equivalent to the to the minimum

113
number of moves the King can make on between two chess pawns. A special measure I want
to mention that not only needs norms but also means inner product, is the cosine similarity
measure, so actually the cosine similarity measure is equal to the cosine of the angle between
the two factors and that can also be calculated as an equation in terms of square roots. It's
interesting because in some cases when we compare feature vectors, this size, the length of
the vectors are not necessarily important but the rather the angle between them, of course it's
a little abstract here because if we have a feature vector and another feature vector but what
does an angle mean, but there are obviously some domains where it makes sense primarily to
look at the difference in angle and that's exactly what this measure measures.

So obviously this algorithm gets very different property depending on which similarity
metrics you applied. So if you have a nice system where you can simulate this and you can
easily produce Voronoi diagrams visualization, then you can then easily see that when you
plug in one matrix you get one result and when you change you get a difference, so here you
can see an illustration how you can how the Voronoi diagrams looks when you apply in a
Euclidean matrix on an application and then what you get when you change to Manhattan
distance for example.

Some of the similarity measures are pretty straightforward and easy to understand, but maybe
the use of the cosine similarity is a little more unintuitive, so to give you an illustration of the
use of the cosine similarity, I included an example here which I find pretty neat. It also shows
actually it's a simple example that in many applications you have to map, you have to map
properties of instances in a domain that is a different rather different form into a kind of
feature vector. So this is an example of a domain where you want to look at medical texts or
pharmaceutical tags or medical texts and you want to measure the frequency of certain kinds
of terms. So the trick here is you take each text fragment or page or whatever is the unit of
analysis and then you count the number of the key terms you want to measure, and then you
choose as features or dimensions of your instance these terms, and the values of each such
term becomes the frequency of that term in that particular text. So this it shows how you can
take like a normal text and the page is of course the kind of raw instance data, and then you
transform that text into this feature vector that just keeps the key terms and the frequencies.
So then when you transformed the instances of that domain into a feature space and then you
can apply these kind of methods, and then also intuitively it makes sense that the it's the
relation here of frequencies it's not so important with the actual number,it's more important to
see is it one term that dominates over the other, so intutively I hope you can understand that

114
that the difference is that the distance between these are more of the angle between the
vectors because the angle is influenced by the relative size in the different dimensions, so
therefore this is a case where the cosine matrix makes sense.

Metrics for feature vector spaces are important but not necessarily we have to be tied to that
so sometimes it could be so that our instances are something else, so our instances could be
sets, binary strings, texts and so on that's possible. So therefore I included here three
examples of similarity measures that works on that kind of instances. So if you look at texts
or words, we have something called the Levenshtein Distance which actually which actually
embodies the idea of changes or substitutions that have to be made of single characters in the
text or wording in order to create the other. So Levenshtein Distance is an example how your
view can define a similarity matrix or distance matrix between text. Similarly when if the
instances are sets and we have something called the Jaccard Index which measures actually
then similarity between finite sets and it's defined as the intersection of the two sets divided
by the union. So the final example is something called Hamming distance which is a classical
distance from information theory, where you actually it's very similar of course to the
language time distance but it's defined for binary strings. So actually the Hamming distance is
the number of positions at which corresponding symbols are different. So it measures the
number of minimum substitution required to make one the two strings equivalent.

Going back to the k nearest neighbor algorithm is automatic (32:35) before we leave it some
issues that one have to consider for this kind of algorithm. So actually the theoretically the
number of k can number k can be anything but typically one should avoid an odd number to
avoid tied votes, very practical comment. And either you define k yourself or you can try to
learn it and in the next slide we exemplify a kind of method it is also a kind of learning
method how one can define or infer a suitable value of k through something called the
bootstrap method. And another comment is that this kind of majority voting for deciding that
can be problematic when the distribution is skewed, so of course this method is also
dependent on on how the instances are distributed in the instance space. Also if you have a
very large number of examples of one class and very few or another, it also can create an
issue because of the density or of the dominating class can happen to be high in the context of
a new example, so this can also create anomalies in the analysis. Also if you have a large
feature set and where you haven't really done a thorough feature engineering many of these
features can be are irrelevant and therefor those instance space expressed in these features can
have strange properties of balance the relevant histories may be in one corner, the irrelevant

115
in another. So this may also cause problems and relating to that you can actually get a very
spares instance space and it may be so that this idea to have exactly the same distant metric
for every case may may have negative. So of course there are positive things with k2
neighbor but as you have understood from this line there are also a number of issues that can
occur because the structural properties of the instance space can be skewed in many ways.

So last time I mentioned on method to mainly define optimal values of some of these
parameters or hype parameter needed to be set for an algorithm like a k-near neighbor one.
One such method is called the Bootstrap method and this method works in such a way that
one takes small more samples out of the original dataset and you draw observations from the
large sample one at a time, returning them of course because you won't disturb the dataset
after this process the dataset have to be as it was so. This method allows you to use
observations for a pre study and then you can as if nothing (35:54) happened you can then
continue in the end with the original analysis is called sampling with replacement. So the
booster method can be used to estimate the quantity of the population so you repeatedly take
small samples, calculate whatever statistics you want ,so in y our case last slide you try to
calculate prognosis for good k, then take the average of the calculated statistics.

So to summarize in the boot strap method, you choose a number of such samples you choose
the size of the sample and the draw a sample with a replacement, you can latest statistics and
then you take the calculated mean of the statistics, of course you know you have to be as
precise this is general method so you have to define of course precise measures for
calculating the statistics you're interested in, but it's a general scheme for making pre studies
so to say of a data size, the data set in order to get a good prognosis for the setting of a hyper
parameter. For practical reason not to make the video too long we make a break here, and we
continue to talk about instance-based learning in part two. thank you bye

116
[Music]

Welcome to the second part of the lecture 4.4 on instance-based learning. I will now talk
about something called Distance Weighted Nearest Neighbor Algorithm which is actually a
pretty straightforward extension of the k nearest neighbor classifier, and if we consider that k
nearest neighbor classifier can be viewed as assigning the k nearest neighbors a weight of 1
and all others await all 0, the natural extension is to also weight the nearest neighbors, that is
that the ith nearest neighbor is assigned weight wi from I = 1 to k, where k the number of
neighbors we consider. So normally one have to define a specific function that computes this
weight, one where are the common choice for that function is to take the inverse of the square
of the distance between the instance to be classified and the training instance in question, but
this function could also be some of those kernel functions that we will described a little later
in this lecture. So then there are two cases of course as for the k nearest neighbor algorithm,
so we have the classification version where the output is a class membership, so actually then
what happens is that the query instances assigned the class label most common, amongst k-
nearest neighbors but where the vote of the each neighbour i is weighted with wi. So it's not
an equal vote for the weighted vote. So in k-NN regression where the result should be
property value, so we calculate this value as the weighted sum over property values divided
by the sum of the weights. So it's pretty straightforward extension from the normal k, nearest
neighbor case. It's also possible when you weight that you consider to extend the k nearest
neighbor from k to all data items, I mean this means that you would consider all data items in
this process. So one can say that two versions here of the nearest neighbor that you keep to a
normal k number like in the original algorithm, we call that a local weighted method or we
can extend this to all data items and we call this method the global weighted method. So the
normal weighted nearest neighbor algorithm can handle both classification and regression,
however in both cases the algorithm approximated target function for one specific query
instance, that means for one point in space. So we have another concept here Locally
Weighted Regression which is actually where we extend this approach of regression from one
point to a surrounding around a query instance q, and if we look at the term so there is a
reason why it's called local, because we do regression not over the whole space but we do
regression based on data near to xq. And of course we still keep the word weighted here and
that the contribution of the training instance are weighted based on the distance from xq. So
the weights are defined typically by a function and as you will see we will call this kind of
function a kernel function, but actually generally it's a function where the weight of an

117
instance the contribution of an instance is weighted and the weight is higher close to the new
instance and it's further away, and so and one can say that the kernel function moderates the
original distance measure that's why we're looking at it. Regression of course is natural
because we still aim at approximating a real valued function. And this function could be how
any kind it could be linear, it could be quadratic etc. and you see in the picture below an
illustration how this will look like where we have a certain point and then when we have
some kind of weighted curve that determines the importance of the surrounding items. So a
kernel function is a concept imported from nonparametric statistics and when you say kernel
is a window function that it's defined over a certain window when the argument to the
function is a distance measure one can say that the kernel function is a moderation of the
original distance measure which means that the effect of the distance is moderated. A kernel
is normally a non-negative real-valued integrable function and for most applications it's
common to define the function, also to satisfy some other constraints, for example that the
integral over the whole area beneath the function is 1, so the area is always 1 and also that the
function is symmetric with respect to 0. And there are a lot of these functions and they are
exemplified in this graph and I can say that for all of them apart from the uniform one, we
can see the decreasing weight or importance for points or for distances further away from the
center. I also want to mention another approach to instance-based learning there's that
connect to what we talked about here kernel functions. So kernel functions are also useful in
a kind of weighted instance based learners called Kernel methods. The kernel function here
serves as a similarity function, this kind of method typical computer classification has a
weighted sum of similarities, actually here with this approach the class label are modelled as
+1 and numerically as plus 1 and minus 1, both for the class label of the training instances
and for that output down labelled instance xq. So in this approach it is also assumed that there
is a the weight allocated to his training example on that and you have this function k that
measures a similarity between any pair of instances, and in this kind of approach this kind of
function is called the kernel function. And so what we do is we with sum number of terms
where term it is the weight of an instance a certain instance, the classification has certain
instance which plus minus 1 and this kernel function between the target instance and the
selected training instance. And then we take the sign of that sum and that sign and the sign of
that sum becomes the class label of the target instance. So this is a pretty old method it was
the first instance of it was invented many years ago in the 1960 and was termed the kernel
person drop.(10:56)

118
Now we will switch to a few other but related topics as you may have already understood
machine learning is a pretty complex field and there are many concepts many approaches
many kinds of algorithms and there is no self-evident trivial characterization of all these
things. So what I try to show you on this slide is how I want to talk about a few concepts in
the in the rest of this lecture. So actually the focus for the rest of the lecture with something
called a support vector machine. A support vector machine is a kind of scheme or type of
algorithms which are an instance of something called binary linear classifier and essentially
the common theme for all of these is to look at the instance space and try to find lines or
surfaces that distinguish between different classes and for the word binary of course signifies
could we talk about two classes yeah in the basic case. So this is a special kind of algorithm ,
however I find it natural to discuss it here because typically the support vector machines
work in a work in an instance based learning fashion in the sense that this kind of algorithm
does not build up explicit hypothesis structure. So rather it computes indirectly some kind of
surfaces in the feature space which corresponds one can say to the hypothesis. So therefore
support vector machine in my mind is very much inspired by instance-based learning and the
case is though that support vector machine in its basic form handles only the linear case. So if
we want this technology to be able to handle non-linear case, we need two answers of
something and actually what we then do is that we the trick is to map the instance based on to
another space. So that in the new space we create it's possible to handle it with linear methods
and the technology for doing that is termed kernel methods. So that's logic of the rest of the
lecture.

A few words about Binary linear classifiers. Such classifiers are both binary in the respect
that they distinguish only between two categories or classes. They are linear in the respect
that the classification is based on a linear function of the inputs. So each data item is
characterized by a vector x some features 1 to n, refer to the features of it of in instances x as
ai(x). Associate with each instance is also binary a valued class label c(x). So the goal of a
binary linear class is to find a hyperplane that line or plane the separate or something else in
higher dimensions and suppress the instances of the two categories. The property of the
instance space required for such a hyperplane to be found is called linear separability.
Obviously the linear classification can only handle linear separable cases. So you can see a
very simple example of this to the right. The lender (15:54) case which is easy to find and
applying in (15:52) two dimension and the case where the limiting area clearly is not linear.
So techniques that can handle non-linear situations we call nonlinear classifier techniques.

119
So this slide simply shows another two other examples of hyperplanes. So I mean what's
important to understand is that for an n-dimensional Euclidean space and hyperplane is of n
minus 1 dimensional subsets. So this means, that for if we have a two dimensional space
which we have to the left hyperplane is a line, if we have a three dimensional space the
hyperplane is a plane and so on. It's difficult a little more tricky to illustrate how an hyper
plane actually looks in that in higher dimensions. The support vector machine is a model for
machine learning which operates in a supervised mode with pre classified examples. It can
also brighten an incremental model. It is an instance of a non probabilistic binary linear
classifier system where binary means that it classifies instances into two classes and linear
means that the instance space have to be linear etcetera. A vector machine can be used to
handle nonlinear problems, if the original instance space is transformed into a linear
separable one. It can also manage your flexible representation of the class boundary, contains
mechanist to handle overfitting, it has also a single global minima which can be found in
polynomial time. So it has many nice properties, you easy to use, it often has a good
generalization performance and the same algorithm solves a variety of problems with very
little tuning.

The support vector machine or SVM performs binary linear classification by finding the
optimal hyperplane that separates the two classes of instance. So it's a typical instance of a
binary linear classifier. A hyperplane in an n-dimensional Euclidean space is a flat n minus 1
dimensional subset of that space, the divides the space into two disconnected part. The two-
dimensional space the hyperplane is a line dividing the plane in two parts as an example. So
new instance are mapped into the instance space and predicted to belong to one of the classes
based on which side of the hyperplane there are position. One should observe here that the
hyperplane is not directly stored or concretely stored as an hypothesis, it's rather computed on
demand which is typical for instance based systems. The support vector machine can even
full case only be applied to linearly separable instance spaces. So in the basic form it has
limited coverage.

So I want to give you an informal outline on the support vector machine algorithm, so it
works like follows. So a small subset of instances in the borderline region between the main
instance space regions for the two classes are chosen as corner stones for the analysis. These
instances are called “support vectors”. So as you understand these have also given name to
the algorithm. So the SVM algorithm aims at maximizing the margins around the separating
hyperplane we look for, the maximizing the distance between the target hyperplane and the

120
chosen support vectors. So as you can see in in the picture to the right depending on we select
the hyperplane or line in this simple case, the margins becomes more or less broad. So the
album's aims at finding the maximum margin. One can say that the optimal maximum margin
hyperplane is fully specified by the chosen support vectors. Over there is an optimization
problem had to be solved but this problem can be expressed as a quadratic programming
problem that can be solved by standard methods.

SVM's are very straightforward to apply in linear cases, but we also want to look into how
we can apply SVMs in nonlinear cases and also in the cases of non-feature vector data. So as
the SVM only handles linearly several instance spaces with instance expressed in fixed length
real value feature form, all problems that we want to approach SVM have to be transformed
into such kind of linear separable spaces, typically most cases when we create a new space in
order to achieve the separablity we want that space will typically have more dimensions than
the space we start from. So the simplest case the most straightforward case is that we still
have instances expressed as feature vectors but the instances are so configured that they are
nonlinear separate, and this is where we will focus the rest of this discussion. The other
problem which is also important in all cases is the problem of being able to apply a method
design for handling feature vectors, for instance spaces where the instances are expressed in
other forms. And I already illustrated for you an example of that in the context of the cosine
distance measure similarity measure where one have to ,we have instances there are texts
when I have to map those texts, the features of those elements of those texts into feature
vectors. And this goes for other things it's so you can see to the right examples of complex
molecular structures, if those are the row instances those also have to be transformed into a
feature vector form in order to use this tiny standard kind of method. A key concept in our
approach to transforming a nonlinear feature space to an equivalent linear space is to make
the new space more high dimensional, a problem with that is that many times computations
of instance coordinates in the high dimensional target space is often more complex and
costly. Luckily enough the SVM algorithm only needs distance/similarity measure between
instances position in a linear separable space. So this means if you create a new map or
reduce space on to a new space which is linearly separable, that's of course a requirement, if
we do that then we can see to that we define a similarity measure in that new space, such that
the similarity measure is expressed in terms of the coordinates from the original space, which
is more easy to compute. So to summarize we define a similarity measure that is applicable
and meaningful for a new high dimensional space, the SVM algorithm can use those

121
similarity measure however the detailed calculation of the similarity measures are defined by
a function which called the kernel function. So that they are expressed in terms of coordinates
from the original space. So then of course is very much up to how we choose this function so
that can work that way and the clever choice actually is to define the kernel function in terms
of the inner product of the mappings of the coordinates in the original space. So this scheme
is what is called the Kernel Trick.

Let's look at an example so consider a two-dimensional input space and a three-dimensional


target space with the following feature mapping onto the linear separable three-dimensional
space. To the right you can see what we're trying to do, we have a two dimensional space
where we have instances that are not linearly separable what we now try to find is a mapping
that map's the space into a three dimensional space, where the instances is more clearly
linearly separated. So what we do is to issues a mapping, so actually we map at an instance
x1, x2 on to three new coordinates, with the first coordinate in the three dimension is a
square, and of the x coordinate in the first, and the second coordinates the second and the
third is a square root of two times, one times larger so okay. So this is how we define it, so if
we then look at two vectors x and z ,and we can consider them of course in both spaces and
we look at the similarity function in the three-dimensional space which we choose to be the
inner product of x and z in that space, and then we expand that through a normal calculation
using the definition of the mapping, and then it turns out that when we done that calculation it
turns out that becomes equivalent, of course depending on our initial choice of mapping. As
the square of the inner product between the corresponding vectors in the two dimensions, so
this means that we fulfilled what I've said in an earlier slide that we want to define a
similarity measure in a higher dimensional space but still in being able to compute these
distances in terms of the coordinates in the original space. So this is very simple an example
of this can be achieved.

So apart from extending the use of support vector machines from the linear case to the
nonlinear case which we now discuss extensively, it's also interesting to see how it can be
extended to two other classification scenarios. So one natural case is to see whether we can
extend the use of this technology from binary classification to multiple class problems and
the strategy here is then to reduce the single multi-class problem in to multiple binary
classification problems, because for each of the second week we can then apply this rehab
technique. It's also possible but maybe no not as exactly a straight forward but possible to
modify the algorithm, so it can handle regression keeping most of the key properties of the

122
algorithm. So this was the end of lecture 4.4, on instance-based learning. So thanks for your
attention. The next lecture for point 4.5 will be on the topic cluster analysis, thank you

123
[Music]

Welcome to the fifth lecture, lecture of the fourth week of the course in machine learning. In
this lecture we will talk about cluster analysis. so the agenda for the lecture next is as follows,
very will first talk about caster analysis in general, we say a few words about hyper
parameters for this kind of algorithms, talk about distance measures and then we will choose
one formal categorization of clustering algorithms, actually five categories Partitioning based,
Hierarchical-based, Density based, Grid based and Model based. So there will be a short
discussion about each of these categories. So cluster analysis is an important element in
unsupervised concept learning, this means learning on multiple concepts from unsorted
examples. Apart from being an important methodology for pre-processing of data sets in this
unsupervised machine learning scenario, cluster analysis can be used as a standalone
technique for particular categorization purposes. As instance are not classified in the
unsupervised scenario, algorithms have to identify commonalities and structures in the data
set and to group teams as a based on similarity. Then when we found these groups. the
detailed concept formation can continue but then typically by using any of the techniques for
supervised learning as described earlier in actually as for scenario one to ten as one of the
earlier lectures in this course.
Cluster analysis has many synonyms like clustering, conceptual clustering, Clustering
techniques, clustering methods etc. Cluster analysis is the task grouping a set of objects in
such a way that objects in the same group all the cluster are more similar in some sense to
each other than those in other groups. Cluster analysis can be achieved by various algorithms
that differ significantly in our understanding of what constitutes a cluster and how to
efficiently find them. There possibly over 100 published clustering algorithms. Typically
clustering algorithms are dependent on several hyper parameter settings, as for all machine
learning techniques. Potentially these parameter settings that can also be automated based on
separate learning processes. The default is to preset them, but they can also be learned
theoretically.
So let's look at a few examples of hyper parameters that may need to be specified for
clustering algorithms. So of course one important parameter is the number of clusters.
Several algorithms need that to be predefined. Also the number of features you to describe
the instances is of course important that that's of course general to all machine learning
algorithms. Type of distance measures to employ is a crucial thing. For certain measures
density is a concept so and one approach there is to set up threshold for maximum distance

124
between instances as a criteria for being dense and also the number of instances that have to
testify such a density threshold. But they can also be alternative density special measures also
Also rather general for most machine learning algorithms many times you need to specify
have how many sessions for expression of the data set that you would have.
Distance matrix have been described in the lecture on is the instance based learning, but as
also a very crucial role in cluster analysis. A distance metric or measure or function is
typically a real-valued function that quantifies the distance between the two objects. Also said
in the earlier lecture but I would like to repeat it here again is that distance metric and the
similarity metrics have been developed more or less independently for different purposes, but
usually specific similarity measures are intuitively inverses of corresponding distances
metrics and can therefore be transformed into each other. So we're then in the earlier lecture
also exemplified with two categories of matrices that are pretty common, so either there are
matrices in a long Euclidean vector space like the Minkovsky distance and it's its
specialization, Manehattan, Euclidean and Chebyshev distances, you can also have the
Cosine measure. But we can also matrices that can work on overlapping elements in other
kind of data representation not necessarily feature vectors. So they have a Levenshtein
distance for text, we have the Jaccard similarity for set, so we have the Hamming distances
for binary strings.
So as there are so many as 100 or more published clustering algorithm, these clustering
algorithms can themselves be clustered in many ways and you find in the literature very
many ways of doing. So for this lecture are just choosing one of these ways of categorizing
the clustering algorithms, and the categories I've chosen as most natural are partitioning
based, hierarchical based, density based, grid based and model based. And as you can
understand all these hundreds of clustering algorithms can then be associated with any of
these categories. Many times there are borderline cases, so sometimes one kind of algorithm
could depending on perspective be viewed as one category or the other, but more or less I
would say that the well-known algorithms at least those I know could be reasonably well
classified according to this model. Let's start with partitioning based clustering.
So partitioning algorithms are clustering techniques that subdivide the datasets into a set of k
clusters. And this number k for this kind of a normally have to be preset. A majority of
partitioning algorithms are based on a selection of prototypical instances or synonymously
centroid instances. This algorithm may be termed Centroid clustering technique, so can say
it's a sub category. In this approach the selection of Centroids are iterative optimized and
instances are iteratively reallocated to the closest centroid to ultimately form the resulting

125
clusters. The result can be illustrated in a positioning of the data space or as we did for
another purpose in instance spaced learning in a formal Voronoi diagram, you can see one
such thing today. So properties of this argument, the target number of clusters really is k need
to be preset and of course the setting of that parameter is it's very important. Also the initial
seeds for these means or centroids have a strong impact on the outcome of the algorithm.
There is some evidence that that partitioning may perform better than others like hierarchical
approaches that we discussed later because the clusters they produce are kind of have a better
fir or a tighter than clusters produced by other methods. There are series of algorithms of this
class but the most important and also oldest approach is K-means clustering and we will
discuss this in little more detail.
Partitioning based clustering is here exemplified by the approach in the k-means algorithm
and actually k-means algorithm is also an instance done of the subcategory centroid based
clustering. So the goal is to partition and instances into k clusters. So I'll start with selecting k
instances out of the instances in the dataset and allocate these as the initial means or centroids
or prototypes, terminology varies and there are some motivation for all these terms. The
distance is calculated between each of these means or centroids to every other instance.
Typically in the k-means algorithm or originally at least (10:24), Euclidean distance is
applied. Then the next step when that has been done, you associate all the instances to the
closest means according to the closes me according to the calculation. And what you got then
is a division of the total set in subsets and we let these subsets that you get a base form
(11:02) constitute the initial clusters.
So this you can see below in the example because this partition first purchasing of the data set
(11:13) in subsets. What happens then is that we really find the means and now you can
understand why we also use the term Centroid it's because in mathematics our is a way of
taking a lot of points in a space and calculating the most central concepts based on a central
object based on all the data points. So this is now done so if we take all the instances that
happen to be in one of the original cluster and calculate from that set the centroid of those
according to traditional definitions of centroid. So this means you get a new mean or new
central point for each of the initial clusters, but then you also for all instances recalculate the
distance from all of the instances to the new the newly computed centroids and this means
that in this step it can happen that the instances change cluster membership which of course is
a nice feature of this because in a way we can revoke earlier less optimal choices. So then all
of this is iterated many times until we reach a stability in the sense that the centroids do not
move between these iterations and then the algorithm typically stops. Let's now turn to the

126
second type of category of clustering techniques, Hierarchical based clustering or
Hierarchical clustering techniques. So this is a kind of technique which seeks to build a
hierarchy of clusters not only one layer of clusters but rather what was earlier mentioned in
some uncommon mixtures like taxonomy so the results of the hierarchical clustering process
is usually presented in what called the dendogram, and you can see to the upper right here on
the slide you can see a dendrogram that is related to an original set of instances depicted in
the claim, and as you can hopefully can infer the structure of the dendogram is based on the
proximity of distance but between the instances and in the claim in the example.
So properties of hierarchical clustering it does not assume a particular value of k as needed by
by the k-means clustering so you don't need to specify that. The generated tree may
correspond to meaningful taxonomy concept hierarchy. So from a domain modelling point of
view and it's a nice technique. What you need is a distance matrix to compute the clustering
steps, or let's say you need two kinds of matrices actually you need a distance matrix between
instances and you need what's called a proximity matrix between the clusters but we will
come back to that in the next slide. Initial seeds have a strong impact on the final results as
assignments cannot be done iteratively. So in a way this is agreements more sensitive to the
kind of local decisions taking during the process. Actually also this technique is very
sensitive to outliers it's another property.
This sides give you more detail very detail actually example of a dendrogram that kind of
diagram that shows the hierarchical relationships between objects. Actually a dendrogram is a
taxonomy, it's a graphical version or form of a taxonomy is very common that hierarchical
cluster can output this kind of structures and of course the role of the diagram is to work out
the best way to allocate objects to clusters. Let's look a little closer to hierarchical based
clustering this kind of clustering can proceed in two ways, Agglomerative fashion which is a
bottom-up approach where each other observation starts in its own cluster, so exactly in the
beginning you have as many clusters as you have instances. And then through the process
parents of cluster emerged as one moves up the diagram. Division fashion where is a top-
down approach where all you start with all observations or instantly one cluster and then you
split that cluster recursively as you move down the hierarchy. And splits and merge that
typically perform based on a proximity matrix between clusters and actually here is important
to understand that there are two concept of distance, here so first you have the basic distance
between instances. So you can say for this kind of system and you start with an distance
matrix between the instances but then it's they have it in in this kind of approach that you
want to abstract a little so therefore you want to calculate also not only the distances between

127
in the instances but somehow the proximity or distance between clusters. It's actually very
simple relation because if you have distance matrix given all the distances between the
instances involve, then you can create a proximity matrix which is the term used between the
clusters by taking for each cluster you take the average the two clusters you want to compare,
you take the average of the distances between the distances in two clusters. So actually the
proximity matrix is an abstraction based on the distance matrix. The proximity matrix is then
really calculated in each step of the algorithm because the distance doesn't change, so the
distance matrix is the same but because the during the process the candidate clusters shift we
need to calculate the proximity matrix in every step for each new setup of candidate clusters.
In general all the steps in this algorithm the merges and splits are determined in a greedy
manner which has the effect that you can have the risk of not reaching a global maximum we
have discussed this earlier all techniques that are greedy that take local decisions that cannot
be revoked or reconsidered and then there is this risk, So to the to the right you can see an
simple examples where you have a number of instances but these instances are then grouped
into these five clusters and then you can see that the proximity matrix is then a matrix with
them with the clusters as rows and columns and you have this calculated value of proximity
based on the instance distances the elements of that matrix.
Now we turn to the third category of clustering techniques. To start with density based
clustering. So density based clustering is a clustering to be which group together instances are
closely packed together. Instance with many nearby neighbors, marking as outliers instances
that lie in a low-density regions whose nearest neighbors are far away. So properties of this
kind of algorithms are, that clusters are dense regions in the instance-based separated by
regions of lower instance density. A cluster is defined as a set of connected instances with
maximal density. So this kind of algorithm does not need a predefined target value for
number of clusters but needs definitions for distances for each ability and density. It can
discover clusters of arbitrary shape and it's pretty insensitive to noise, so there are many
advantages here. We will describe this technique little more concretely in terms of the
approach taken in one of these algorithms called DBSCAN.
So let's now go through one of these algorithms called the DBSCAN, the density based
approach. So the instances are classified in this case as core instances, reachable instances
and outlier. A core instance has a minimum number of instances within a threshold radius.
An instance is density reachable from another instance if it's within a threshold radius from
the core instance. An instance is density connected to another instance if both instance are
density reachable from a third instance or if they are directly density reachable from each

128
other. All instance is not reachable from any other instance are considered outliers could
possibly be noise. If p is a core instance then it forms a cluster together with all instances that
are reachable from it. Each cluster contains at least one core instance; non-core point can be
part of a cluster but they form its edge. All points within the cluster of mutually density
connected. If a point is density reachable from any other point in the cluster is part of the
cluster as well. So here we can see that we built up at the clusters through the core instances
and then through the density reachable property. And the two important parameters here is of
course the threshold radius but also the minimum number of instances that must be within the
threshold radius in order for an instance to be counted as core. So in the example to the right
you can see that the red instances our core instances because the area surrounding is in a
epsilon radius, the threshold radius contain a specified number minimum of four points in this
case. Because they are all reachable from one another they form a single cluster B and C are
not core points but are in table form A and that's belong to the cluster is well. At the point N
is it is a point that is neither a core point nor directly reachable and it's not therefore
considered a part of the cluster.
Finally we will look a little more closely at something called grid based clustering. So grid
based clustering, you quantize the instance base into a finite number of cells hyper rectangles
and perform the required operation on the quantized space. So typical steps in this kind of
algorithm is to find a set of grid cells, assign instances to the grid cell and compute densities
of the cells. Eliminate cells that have densities below a certain threshold from clusters from
adjacent cells based upon some objective optimization function. So actually if I could say that
grid based clustering is density clustering of a certain kind also with this introduction of this
kind of grid structure.
So finally I want to say something about model-based clustering even though it doesn't really
fit into this week because as you know the title of the whole week was inductive learning in
the press is in the presence of weak or absent domain theories. So model-based clustering is
actually a clustering technique where we have some models or background knowledge or
theory about the domain from which from the domain for which the instances of the datasets
is harvested. So this model can more or less extensive back in in all cases to some extent to
guide the clustering process and the basic clustering process that we start from can in
principle will be any extension to one of the other clustering approaches we discussed earlier
here. If the domain knowledge is some statistical information about the distributions where as
kinds of instances involved one can call this kind of clustering technique distribution based
learning as a category of its own. So example of a distribution based clustering scenario

129
where you have sample instances that the arise from a distribution that is a mixture of two
more components. And two components means of course two types two types of instances.
So each type of instances is described by density function and has an associated probability
or weight in the mixture. The mixture is of course the law the hole the H rainiest outer set
with these kinds of instances (27:01) . In principle we can adopt any probability model for the
components per typically we will assume that they are p-variate normal distributions. And of
course what we end up with here is that each component in the mixture becomes what we call
a cluster. So this was the end of this lecture on cluster analysis thanks for your attention. The
next lecture and the last one for this week will be the tutorial regarding the assignments for
the week thank you bye.

130
[Music]

Welcome to the last video lecture for week four, and as always this lecture is some
preparations for the assignments for this week. As always we group the assignments into 2
groups, questions based more directly on the video lectures and then questions that are at
least to some extent based also on extra material. So if we turn to Group one first, you can see
here that these questions or assignments follows pretty closely the structure of the lectures for
the week, so there is one week on each of the lecture themes but we will stick to ten questions
here altogether, so we have decided to have two questions on the instance-based learning
theme, which is the maybe the densest lecture with respect to content this week. And as you
probably have understood my principle here is to through the assignments, highlight for you
some of the important points during the week. So for generalization as search, it's the
question how our generalization takes place, in this case in a depth-first search, in the second
lecture focuses on how the entropy is measured as one of the important measures used to
guide the buildup on decision tree. So for instance based learning as one question related to
the distance measures used in many of the instance based learning algorithms, and finally
there is one question related to the kernel related methods. Finally for clustering there is one
question focused on hierarchical clustering technique. Concerning the recommended extra
readings you have by now discovered that I systematically try to recommend to you most
cases some original articles, key articles for each of the themes. What I want to say here is
that these key articles sometimes is both of course original because if they are of core
importance for development of each methodology, in some cases they are also very readable,
in some cases are not so readabl,e so I guess you will discover that and even though some of
them are not so readable and you shouldn't feel that you have to read everything in this extra
recommended reading to satisfy the requirements for this course. I mean it's but still I will
stick to try to give you the original articles I mean it's so easy for this area of machine
learning to just go and search on the net for various documents, so you can get various
documents and you can judge yourself or they are readable a lot, my contribution here is
mainly to point at the original work, which it may be not always so easy to do yourself. So
for so for this week I recommend you one article by Tom Mitchell, an early key article on
generalization as search. So for decision trees algorithm I recommend you an article by Rose
Quinlan was one of the key persons in this development called induction of decision trees.
For instance based learning you get a particle by HART, very early article which was one of

131
the starting point for this technique, the nearest neighbor patent classification technique. And
for the second you also get a second article there for that theme by another well-known
researcher Vladimir Vapnik on support vector networks. And finally you get one article on
clustering techniques, an early article from 1967, MacQueen between regarding some
methods of classification analysis or multivariate observations. Let us finally turn to the
questions in group two and as you see they are also spread across the four themes, with two
questions on the instance based learning and hopefully these questions will also highlight for
you some of the key points in the lectures you have followed this week. So this was the end
of this short tutorial on the assignments for week four, thanks for your attention. The next
week of the course will have the following theme theory based learning based on symbolic
representations.

132
[Music]

Welcome to the first lecture on this fifth week of the course in machine learning. So the
theme of this week is called machine learning enabled by prior theories, and the purpose of
this first lecture is to explain the background for the choice of theme and give you an
overview of what will happen this week. When studying machine learning, it's easy to get the
impression that machine learning is typically learning from a lot of examples in a kind of
vaccum but from these examples the goal is to induce some abstractions or hypothesis but
with very little or I was a minimal guiding information. Of course there are a lot of
techniques that we look at and that we covered on the course so far that is of this character,
but there are also a lot of work going on where you actually look at learning algorithms that
rather learn on the border of existing knowledge which means that you actually start with a
more or less substantial theory and augment or debug or this theory. Also I would say that is
the case that when time goes by and this area becomes machine learning gets more mature,
we will use all the experience we have had by developing these more stand-alone techniques
starting with one examples but it will be more typical in the future that these techniques will
be applied also in settings where we have a lot of surrounding already existing domain
knowledge. So in this week I've chosen a number of sub theme to discuss, which are
examples of this situation where we have a prior theory and where we're learning in way
improves in various fashion that are domain theory. And I sorted these sub teams there are six
of them in kind of three groups, so the first group of sub-themes have to do with the situation
where we look at mixed inductive, deductive and Abductive scenario, and essentially this
kind of work stems out of computer science logic tradition. And those are explanation based
learning an inductive logic program and I will come back to that. So the next part which is
just one theme is a reinforcement learning and I will spend considerable time on that because
it's first of all it fits the theme of the week because actually here you're already have a system
which is designed but you want to improve that system and also they're being considerable
success in this area lately for example in gameplay, after that there will be three other
techniques that I will mention more briefly. So I will talk about case based reasoning and
there will be a separate lecture for that, but then in the end of this initial lecture here talk
about two other sub teams learning a Bayesian belief networks and something called model-
based clustering which were already touched last week in in the lecture on clustering.

Starting with the first block of sub themes I will say something now about Inference
techniques. So there are three classical kinds of inference techniques, deduction, in deduction

133
which is the classical way inference in logic you derive a conclusion from given axioms,
axioms which constitute domain main knowledge and facts which are typically observations
for a specific case, and then the conclusion can be derived by applying inference schemes
working from the axioms and the facts using forms of inference like natural deduction for
example modus ponens as a classical thing, or resolution. Then on the other hand we
induction which is the other in the focus of this course, in induction you derive an axiom a
general rule from the observations typically many observations and possibly some
background knowledge that can guide the generalization process from the observations. And
as you have hopefully understood induction is the main inference mechanism for learning but
then we have a third classical inference technique called abduction, which means that from a
known axiom theory and some observation you can derive a premise using a rule backwards
actually. So you start from the conclusion and you and you make an argument about the
premise and actually is a core element of all kinds of explanations, I mean as a great example
you can look at diagnosis in medicine, and abduction also have a strong relation to causation
actually use looking at fossil relations and use them backwards. So if we look at the two
techniques we are going to look at now. Explanation based learning enables a spectrum of
inferences from abduction to deduction. Explanation based learning is very much abduction
but there are also cases where abduction in the presence of complete knowledge turns into
deduction. While in inductive logic programming we start from a full deductive framework of
logic programming based on resolution as the inference technique, and what has then been
done to expand or generalize logic program to inductive logic programming is to see to that
algorithms are developed that also enables induction to be made in a convenient way, so that
you can easily in the same framework combine both deduction and induction. Explanation
based learning abbreviated EBL creates general problem-solving schemata by observing and
analyzing solutions to specific problems. EBL spans the spectrum from deduction to
abduction, it uses prior knowledge domain theory, to explain why a training example is an
instance of a concept. The explanations and identify what features this case predicates are
relevant to the target concept. Prior knowledge is used to reduce the hypothesis space and
focus the learner on hypotheses that are consistent with that knowledge. Accurate learning is
possible even in the presence of very few training instances. There is an obvious trade-off
between the need to collect many examples to be able to induce without prior knowledge and
the ability to explain single examples in the presence of a strong domain theory.

134
Let's go to the second theme of this week Inductive logic programming abbreviated ILP. The
goal of ILP is to find general hypotheses expressed as logic programming clauses, from a set
of positive and negative examples also expressed in the same formalism and all these
possibly in the presence of a domain theory likewise described in the same notation.
Advantages of ILP as I already said now I think one of the main advantages is the possibility
to express all these crucial items for learning examples hypothesis domain background the
theory in the same formulas. It's also an advantage that you can a very convenient way handle
deduction and induction within the same framework. It's also the case that because of the
structure of logic programming where the big strength is that is easy to handle multiple
relation and structured data, also inducted programming from logic programming inherent
this capability which could be beneficial in many areas, one particular examples are
applications in chemistry where there are more many complex stretch striker instances to
handle. Also there is an opportunity in ILP but that's not unique for ILP and there is the
possibility of invent new predicate sets and add them to the domain theory. So the scenario
move only for inductive logic programming is that you start from a set of positive and
negative training examples, expressed in an observation language. And we have a domain
theory, we have a hypothesis language than the limits the kind of clauses that are allowed to
express our hypothesis in and we have some kind of relation covers which works as a
detector on whether a particular example is covered by a hypothesis also taking into the
consideration the available background theory. And the goal given that all that is to find an
hypothesis that covers all positive examples and no negative examples. The next sub theme
for the week is reinforcement learning. So now we turn to a very different area and it may be
it's important to say at this point that in contrast to the earlier two sub areas, stemming from
computer science logic area, we now go into a methodology that is very much grounded on
and inspired by long-term work in control theory. So the intuitive scenario for reinforcement
learning is that you have an agent that learns from interaction with an environment to achieve
some long-term goals related to the state of the environment including of course the agents
itself and it learns through by performing a sequence of actions but in every step of action
receiving some feedback. From the agent point of view we call the mapping from a particular
state with respect to the action that can take and we call that a policy that every moment
major has as a political issue see actions relative to the state they are in. And also with respect
to the terminology the satisfaction of goals is defined by something called rewards. So after
each single action and the environment provides a reward or rewards signal which constitutes
the feedback on the appropriateness of that action. Then we have another a very important

135
concept for all this area which we call a Return, which is it's not the feedback on a single
action but it's actually the cumulation of all the rewards for a whole episode, which is the
term for a sequence of actions from a state to something consider the terminal state. And the
goal of most of the working area is to establish a policy that maximizes the return for all
positive way of going from one state to a terminal state. And to the right you can see a simple
depiction of the general framework you can also see a practical example of a maze, where a
small robot is supposed to find its way, considering that that robot is embodies a variant of
reinforcement learning algorithm.

The next sub team of the week is called Case Based Reasoning CBR and that's the process of
solving new problems based on the experiences from solutions of similar past problems
expressed in an alternative fashion seabird does solve a new problem by remembering
previous similar problems and by reusing knowledge of successful problem-solving for those
problems. Case based reasoning can be motivated by some such as similar problems of
similar solutions normally, and that many domains are regular in the sense that successful
problem-solving schemes are invariant over time. So what can say there are some
considerations of invariance here as a starting point for you for using this method. And case
based reasoning typically our to contrast it with rule-based reasoning so in a rule-based
system you solve a problem by a fixed or maybe dynamic set as a set of rules while in case
based reasoning everything starts from the cases and there is in many examples no explicit
rule base, we only have a memory of cases. An interesting analogy that is always comes to
mind if you look at legal systems in every country, so in many times called the German
tradition or the Central European tradition law is very much based on rules while in the
anglo-saxon tradition law is very much based on cases. So typically in case based reasoning
cases are stored in a case space or a case memory to be retrieved and used. When a successful
solution to the new problem is found an adapted case can be stored in the case space to
increase the competence. Actually this is this incremental growth of the case base that
constitutes the learning behavior of this kind of system. Technically case based reasoning
primarily supported by the techniques described in the previous lecture on similarity based
learning in an earlier week we'll also call memory based learning, and particular examples of
what is really useful to inherit from that technology is these schemes for distance and
similarity measures that are very crucial for retrieval of the similar cases we want to based on
our reasoning on.

136
Now we come to the last two sub themes for this week and those are the sub themes where
we will spend very little time on and will only show you a couple of slides here and comment
on those themes. In the reading reading recommendations for this week you will also get
some material on this kind of topics, so you may voluntarily look more into those themselves
but there is actually no time this week to go further into these. So the first of these two are is
learning on Bayesian belief networks. So we talked about that some weeks ago and I will
shortly recapitulate for you. So Bayesian belief network is a probabilistic graphical model
that represents a set of variables and their conditional dependencies describing effects in
terms of courses. So structurally BBN is a directed acyclic graph or DAG you can see it
smaller to the right. And in that kind of Network inferences typically aims to update beliefs
concerning courses in the light of new evidence. So you do design or build up the network in
some direction so we can say you could in the forward direction you could have a very
typical deductive reasoning, but the more important part of it that is what you typically want
is that you want to make inferences about courses in the light of new observation, new
evidence. And the major theorem that supports the backward reasoning part here is called the
Bayes theorem which we already discussed in the other week and this theorem makes is
possible then to make valid probabilistic inferences about whether course may hold in the
light of some evidence and of course then Bayesian rule controls the risk inference for one
step in the network but the same procedure can be recursively applied throughout the whole
structure. But then the issue is what is learning here actually one way of handling the
Bayesian belief network is to statically design them so we have a fixed structure you set up a
number of variables you have a node for each variable, you defined exactly the dependencies
in terms of the errors and you also define fully and the conditional probabilities that guide the
kind of micro reasoning in each node. But of course you can also have a situation where you
start with a rudimentary structure. So for example you can have a case where you have
defined the variables, you have defined the connection, so the structure is clear. However the
conditional probabilities are not know so this is what is called here parameter learning, so this
means that we can have a learning situation where we build up the conditional probabilities
given a fixed structure from actual observations or observational pairs of all the pairs or
variables involved or all the combinations of variables involved. So this we call parameter
learning, and then we can of course use different kinds of specific learning techniques we
already looked at for that purpose. The more ambitious task is to also handle a situation
where not even the set of variables and not even the structure is known, so that we call
structure learning and then there are two cases we can learn the variable structure which

137
means that variables are known but the connections are not known. And then finally we can
have a case where we can also learn new variables and you can next week see an interesting
parallel here when we talk about neural networks where it's also an issue whether it's possible
to learn new structures in the same fashion. So this is very shortly the goals of learning in this
context and as I said you will get some reference to some material for this field and if your is
interested feel free to dwell into that.

Finally I want to say something about something called model-based clustering and this I
already you mentioned but pretty briefly last week when we talked about clustering on one of
the lectures. So model-based clustering means that clustering is based on some model or
background knowledge and actually this knowledge is for clustering typically statistical but
its knowledge about the domain from which we harvest the data set. In the basic kinds of
clustering techniques which we looked at last week, actually we build up cluster structure just
from instances with very little or non-existent background knowledge, because of the fact that
most of the time the background knowledge for this kind of in the clustering case is of
statistical nature, one also call this kind of techniques distribution based clustering or
statistical based clustering. The model or the knowledge variable can be more or less
extensive but will in all cases guide the clustering process to some extent in contrast to the
non informed variance. But it's important to say mainly it can be a little confusing because we
introduced categories of clustering techniques last week but for model based clustering
essentially you can start from any of the categories of clustering techniques and augment it by
adding domain knowledge as a guiding principles. The most common case I would say in this
area is that what we have the knowledge we have are statistical distributions regarding the
objects that we look at, essentially then we about statistical distributions for the different
kinds of objects which in a way then indirectly refers to the potential clusters we want to
discover. So examples of these kind of methods are Gaussian mixture models where actually
we have a fixed number of Gaussian distributions, that are initial randomly in the beginning
but whose parameter iteratively optimized to better fit the data set, but this Gaussian
distribution in a way till then are related to the different hypothesis groupings of the dataset.
We also have some clustering techniques based on Bayesian statistics and it is a very well-
known systems use very much practical actually in the industry called AUTOCLASS, and
then finally we have something called conceptual clustering techniques with - well one well
very well-known system called COBWEB. By this I am end initial lecture thanks for your

138
attention so we will now go to the different subtopics and the next time lecture 5.2 will be on
the topic of explanation based learning thank you

139
[Music]

Welcome to the second lecture of the fifth week of the course in machine learning. So as you
know this week we will focus on machine learning techniques enabled by prior theories and
the topic for this particular lecture is Explanation Based Learning. Explanation based learning
abbreviated EBL is a machine learning technique where we try to learn general problem-
solving schemata by observing and analyzing solutions to specific problems. So we are
interested in how this can be done in machine, but it should be obvious to you that this kind
of technique on a conceptual level also is very much relevant for the way we learn as humans
in everyday life, and this is illustrated in the very rather famous cartoon to the right where the
these three guys looks at their friend and see what he does and probably at the later stage will
abstract from that. Explanation based learning creates learning general problem-solving
schemata by observing and analyzing solutions to specific problems and EBL spans the
spectrum of inferences from deduction to abduction, however I would say that if we look at
what's particularly interested by explanation learning is his ability to formalize abduction.
And in this process one uses prior knowledge to maintain or explain why training example is
an instance of a concept, and where that generated explanation as one thing identifies what
features are relevant to the target concept among potentially many features described for the
problem situation. Prior knowledge is used to reduce the hypothesis space and focus the
learner on hypotheses that are consistent with that background knowledge and typically
accurate learning is possible even if we have only very few instances even a single instance
could be possible. There is a trade-off here obviously between the need to collect many
examples to be able to induce which we are typically done now for some weeks without prior
knowledge and/or the ability to explain single examples in the presence of a strong domain
theory. Let us try to describe the exact scenario for explanation based learning algorithms. So
given is a target concept defined by some predicate statement, also as an input to this kind of
algorithm you have training examples typically a few, you have a domain theory which are
built up from facts and inference rules that express background knowledge potentially
relevant for the target concept. Probably most cases you have more background knowledge
than you actually will use but of course it's not optimal to have too much background
knowledge because then you have the problem of choosing the right subset, and finally there
should be an operationality criteria which specify which predicates can occur in the right
concept definition. Typically the predicates are those used to the training instance case. So
actually we do not want to define hypotheses that are too general in the sense that these the

140
definition of those concepts are expressed in such terms, that they are not operational in a
domain specific context. So the goal could be refreshed to find alternative operational more
efficient definition on the target concept that is still consistent with logic entailed by both the
domain theory and training examples. So typically we have some definition but rather
abstract of the target concept and the idea here is to specialize it. Also typically and (this kind
of algorithm has the following steps ,so first you have a step to where we explain, by which
we mean that we can also more or less say we build up some kind of inference chain or proof
on why this example actually should be a member of the target concept we're interested in.
And then the idea is that when we look at that explanation or that chain of inferences, then
that can then be in the beginning be pretty concrete in terms of the example term or the idea
is to generalize that explanation in transgene (06:28) To define the most general rule but still
being operational in the sense that we express ourselves in that kinds of terms or predicates
that we have defined as being on the operational level. And finally if when we created this
new definition of this concept we can add that to the domain theory. So the idea is to
incrementally augment the domain theory for each round in this EBL iterative process.

I want to say a few words about different perspectives can have on what EBL actually means.
So one perspective which is that the closest one to the mainstream or in in inductive learning
techniques is as Theory guided generalization of examples. So essentially we have an
induction process but we guide that process and we focus the search for the appropriate
hypothesis by using the theory. And as part of that which is always important is of course that
the explanations we built should be as simple as possible so we should use as few features for
example as possible, but still being able to explain why the example is an instance of the
concept. So that's one perspective that is more close to the inductive machine learning string.
Another perspective is us example guided reformulation of theory, so then we can say we
start more from a theory and can be the case that the have a theory this is over general that is
not consider operational in the sense that it's effective to use for problem solving in a domain.
So what we want to do is to specialize the theory so it becomes more operational and then we
can do that by looking at these examples and the examples can guide us in which
specializations we should make .

The third way of looking at EBL which now we move more and more actually by a step by
step in this little list of perspectives into the detective realm so we can view EBL as
knowledge compilations, so actually the domain knowledge is complete enough to perform
some reasoning for specific cases but what we can do with through EBL is to simplify this

141
inference change, so they've become more efficient in in particular case. Then one can of
course question whether that is learning or not, because the competence in there to solve the
problem, the only we do is that we learn problem-solving skills to become more efficient. So
on this line you can see some graphical depiction on what I hopefully have expressed earlier
that what we really do in EBL is we try to flatten the structure, the chain of inferences so in
same instance of expressing the problem-solving through a rather complex chain of
inferences we can express ourselves in a more flat structure, which is actually always more
efficient often more efficient. So here on this slide you can see a little picture, the purpose of
that is to give you a feeling for the components of an EPL system, so I think central is you
have some kind of problem solver and you specify a case and then you have this we have this
knowledge base which is actually background knowledge plus some rudimentary hypothesis
relevant for the case we there is a explanatory process in where we try to actually solve we
can say we can try to solve the actual problem using the knowledge base, and that generates
an explanation or some kind of inference chain, and then that inference is generalized and
when in generalized and we have a new more general concept definition we had that concept
definition to the knowledge base we get a new problem and so on and so on.

At this point we're going to become a little more formal, and we are going to look at
specifically the generalization part of the explanation based learning algorithm, and actually
and as you remember from one of the earlier slide where I showed you an hierarchical
structure and a flat structure. So roughly you can say that a typical explanation that we start
from in the concrete form is a hierarchical structure where with a lot of inference depth that
are based on specific rule application. So really what we want to do here is that from that
hierarchical structure we will create as flat structure as possible. Yeah however of course we
want the new structure to still correspond to a valid proof, the valid proof we started from. So
this means then we in transform their article structure to the flat structure we must also see to
that there are appropriate connections among all variables and constants occurring in the
original concrete explanation or proof and actually when on this slide the word literal is used
is actually literal just a common term for four could be a constant it could be a variable so to
say. So actually what we need is a machinery that do this advanced pattern matching which is
called Unification which sees to that we do with the appropriate binding of variables and
constants or variables or variables which is then called unification. So that's an important
thing in ingredients in this. So essentially when we if we want to look a little informally on
the algorithm that is described on this slide and you should always have in mind that we start

142
from hierarchical rule application proof and what we went in the end is a flat structure, so you
can see on the last row that is a variable called P which is actually a list of predicates, so
essentially these predicates or literals are collected during the process and essentially the end
result as we want to set our target concepts equal to this flat list this conjunction or predicates
so that's the end. So really what happens in this algorithm is that we iteratively build up this
list and we actually do that through a process where we take the target, we target the
predicate and as it is expressed here regress is true each rule on each level through the proof
structure, always then involving this unification algorithm that it sees to that the variables and
constants are unified together in the appropriate way. So this is intuitively what is described
here and I hope you can look at a little in more detail, we'll also I will also show you an
example and in the coming slide. Here we will look at a simple example, actually it's a
problem where we look at boxes and that can be put on a table and there you can put on each
other and actually the target predicate we want to look at in this example is whether it's safe
to put one box on the other and then there are a number of these predicates who say whether
they are light or not if we try to define light in terms of their weight, we which are try to
define their weight even the in terms of their volume, we can also say that one of these
elements we have is actually it constitutes this table and so on. So we have a little domain
theory that describes these kinds of boxes on their property including the table. And then we
also have a list as I already mentioned it typically for this kind of methods of operational
predicate, this means that those predicates we are allowed to use in our final target definition
of the goals concept. Then we have a training example which is actually the situation where
we have two objects and the properties of those objects and then finally we have the target or
goal concept that is with this case this predicate expressing whether it's safe to put one object
on the other. So this is our domain theory in this case here and now comes the concrete
explanation which is then the chain of inferences through which we motivate why this
particular configuration of two objects with this particular kind of features are such that they
can safely be put on each other, and as you can see his tree hierarchical inference structure
use a certain number of the rules from our little domain theory. What we do now is actually
that we go through the structure we saw a radical one and actually we start from the top with
the goal predicate and then we start what was termed in the description of the algorithm as
regressing that that predicate through this proof structure, and we start from the top and we
take it level by level. So here we can say we look at the first step, the connection between the
SageToStack and the lighter predicate. And we take a step in that way we find appropriate

143
unifier and we build up start to build up our P variable which is actually the list of predicate
that we in the end should form our final result but at this point it has only one element.

Here you can see step two, so we take the second level where we reduce lighter into the
simpler predicate and we do the same we take that we take that predicate and regress that
through this this segment of the proof structure with appropriate unifier and then you can see
how the P variable is built up with an asset the repertoire of consistent variables ethics as a
secured by the unification algorithm. You can see the third step in a generalization process
and which goes in exactly the same fashion and you can see then in the end that P variable
continues to accumulate the relevant predicate with the appropriate variable bindings and
then there is a last step and then at the bottom you can see the final P variable and so now this
regression process is over and actually we only now have to set the target predicate equal to
this conjunction of predicates collected during the process.

EBL is of course very much related to the kind of theory background series we work with and
the quality of those theories I mean for this kind of toy examples we looked at is it's like we
have a theory that is complete even if it's small and they have one little corner we have one
target concepts which we want to operationalize, so it's very well-defined but in general of
course there can be all kinds of properties of domain theories all kinds of imperfections so
they can be incomplete they can be inconsistent, they can be incorrect, incorrect here means
that the theory is consistent internally but the theory does not match reality. So anyway when
I say it's wrong because it's not a good model of domain we want to make a model for but it
can also be intractable because it's so computationally complex to handle and of course there
are two way reasoning here so EBL techniques can of course be used to make theories better
to complete what is incomplete and so on, but also of course other kinds of imperfections of
theories can also disable the EBL algorithms, the function of the EPL algorithms. So the
message is here that that EBL is not a general technique that can be used in a very simple-
minded way for all possible domains theories. Actually in order to be useful the use of EBL
techniques must be must be prepared for by a careful analysis of the available domain
theories and the properties of those and some prognosis of the kind of imperfections they
have and then making some clever choices of what kind of imperfections could be fixed
using these techniques and also being sure that are not other improve imperfections that can
that can hinder or complicate the use of the techniques.

144
Another issue that I want to discuss a little is the utility of the created new operation concept
definitions or rules that we handle. EBL in many cases do not create entirely new knowledge
but as rather the goal to improve the domain theory so that the problem solving becomes
more efficient. So one way of looking at EBL is as knowledge compilation, of course one can
always discuss what is learning is this learning them because it's not new knowledge is just
becoming more efficient on the other hand, if we look at ourselves a lot of the time when we
say that we learn something is that actually that we learn to become more efficient in doing
things. So it could be of course a battery of debate where we are in on the borderline between
planning and learning. But in this perspective EBL represents a dynamic form of knowledge
compilation or a system is tuned to incrementally improve efficiency on a particular
distribution of problems. So it's not so necessarily that the system becomes more efficient for
all problems because we started with a more general theory and then by handling a series of
cases we in a way optimize the system for that particular sample of instances. It's also the
case if we learn many new rules many of these new operational rules we can call our macro
operators, macro rules, search control rules, the terminology is pretty complex here, and it
can also be so unfortunately that to add more rules to the system in spite of the good purpose
we had will deteriorate the problem solving efficiency. So that the amount of generated
additional rules outweigh the benefit that we were after. But of course as always we can come
up with countermeasures so we can be more careful in our selection of rules to store, we can
also see to that sometimes we throw away rules that are rarely used. So by making our
algorithm more complicated we can anyway fight this phenomena that the growing rule set
deteriorate performance.

EBL systems is it's not a new invention, these kind of systems have been built for a very long
time and you'll find on this line a list of examples and for example if you look at the earlier
systems, one famous system this combination called STRIPS+MACROPS is very famous
planning system from the early 1970s which actually also in on top of its planning
performance had had had a capability to building but what was called at that point as macro
operators. Also the hacker system were by Sussman at MIT was from that time and both of
this system if we look at what really happened was very very very similar to what we now
call explanation based learning. So one can say that the evolution of the term explanation
based learning was really starting more or less in the mid 1980s where there were a number
of papers one of them is mentioned here by Mitchell Keller and colleagues where they try to
give kind of create an abstraction for this kind of systems. So you will also get some

145
reference to some of these systems in the recommended readings. This was the end of the
lecture on explanation much learning. thanks for your attention we will now turn in to the
news next subtopic inductive logic programming which will come up in in the next lecture
thank you.

146
[Music]

Welcome to the third lecture of the fifth week of the course in machine learning this lecture
will be about Inductive Logic Programming. Inductive logic programming abbreviated ILP is
an efficient attempt to marry together the area of logic programming with techniques from
machine learning establishing what one can call learning with logic. One of the big advantage
here with this mission is to create that kind of framework in which you can easily express
both deductive and inductive impresses in a very uniform way. So if we look at the two main
forms of inference that first deduction, when we actually apply rules to facts and then can
infer other facts it's a very traditional way and this kind of inference it is a cornerstone of
what you do in logic program so actually we do not have to worry about that. The new thing
with inductive logic programming is to see to that we can also make inferences the opposite
way that we can start with just the collection of facts and from those facts infer some rules
given of course as you hopefully have understood that it will only have some credibility, if
we handle a reasonably large amount of instances or data items to work from. Inductive logic
programming, the goal of ILP is to find hypotheses expressed as logic programming clauses
from a set of positive and negative examples and in the presence of a domain theory where
the latter two also are expressed in the same logic programming formalism. So what we have
given is a set of positive and negative examples expressed in this observation language we
call LE, domain theory T, hypothesis language LH that delimits the clauses that allowed in
the hypothesis space H and our relation covers, covers is important because covers is the
means for us to decide whether a certain hypothesis covers a certain entails a certain example
E, considering also the background theory. So given these pieces, logic programming has the
aim to find hypothesis H that covers all positive examples, no negative examples. So we can
say that the learning goals of inductive logic programming expressed in this way is extremely
much aligned with the learning goals for all other kinds of inductive logic programming.
What is not covered in this is of course, the basic capabilities of logic programming as such,
which from the start enables deductive reasoning, all executions of a pure logic programming
can be regarded as an instance of deduction. So I say this because if you only define inductive
logic program as inductive, you forget the deductive side and one of the big positive things
with this area that the combination is made possible.

Let us now talk about some of the main advantages of ILP, actually the most important
advantage is that by using the same framework the language of logic of logic programming.
It's possible to express examples, hypotheses and domain theory or background theory in the

147
same language. Also the second thing important thing is that one can handle deductive
inference and inductive inference within the same framework. Because of the ability for logic
programming to support structured data types and multiple relations in a convenient way, it
also makes ILP by inheritance from logic programming, also very suited to handles domains
where there is a need to conveniently describe complex structures. One example of that is
chemistry and the complex chemical structures that many applications demand. One thing
that maybe is not the biggest advantage because of course many other systems and
approaches of similar features is that some ILP systems also have the possibility to invent
new predicates and add them to the domain theory, so this is this dynamic property that you
also can introduce here.

As have been said, the nice property of inductive logic programming is that everything is
represented in the same way. So this means that hypotheses or examples, facts everything in
in a way are in the end represented at logic programming clauses. So at the core of inductive
logic programming are the rules of specialization, relation and generalization. And these two
rules or relations connect hypotheses and facts. So one can say that hypothesis let's call it G is
more general thank hypothesis S, if an only G entails S. And then S is also said to be more
specific than G. So there are two then kinds of other relations that are used in different way,
so let's start with a so-called specialization rule or specialization relation. So if you have a
deductive inference rule R, that map's a conjunction of clauses G onto a conjunction clauses
S such as G entails S, this is a specialization rule. And the typical rule in inductive logic
programming of that kind is called Theta subsumption. If we set up hypothesis space and
structure that space using the Theta subsumption specialization relation, this search space will
be a lattice, and for every pair of clauses under the specific conditions of the Theta
subsumption that will exist at least upper bound and the greatest lower bound with respect to
the Theta subsumption relation. So it's a pretty well ordered structure. In this context what we
talk about is inductive process that starts where we start the algorithms from the top, so we
start with the most general hypothesis and this hypothesis can then be specialized in order to
finally reach the facts. The opposite direction is the generalization, where we essentially start
bottom-up and that so there we have a similar generalization rule. The typical ILP rule of
general size and type is called inverse resolution. There's also some variants of that
something inversion entailment, but essentially for that for given how much we time we spent
on this topic and it's fine without one example of a top-down and one of the bottom-up
operation. And in the same fashion a typical generalization rule like inverse resolution

148
establish a least generalization entity from positive examples. It's important to realize that
that in this context of inductive logic program induction is viewed as the inverse of deduction
as far as possible. So that the basis for the way of setting up algorithm for designing the
algorithms and finally one comment is that in inductive logic programming the view of the
process of induction is a search process. So this means that what happens in an inductive
logic programming, induction process is that we search for the appropriate hypotheses but in
a search space structured through the generalization and specialization relations, so therefore
we when we design an algorithm we have to marry together the properties of the search
algorithm and the prompt and specific properties of these relations.

We will now look at what is termed here the Generic ILP algorithm so first it should be
repeated again, that inductive logic programming views thus the establishment of hypotheses
as a search process. So there we of course always have our standard search process could be
depth first, it could be breadth first, it could be best first, it could be anything. Okay, so that's
the baseline, so all algorithms will be somehow phrased in a search process context. Secondly
I already said that we have two kinds of main generalizations that we use for structuring
hypothesis space. So we have a generalization and we have a specialization relation, and the
generalization is more relevant to use in a bottom-up perspective, in a bottom-up process
when the specialization is more relevant to use in a top-down setting. This algorithm that
we're going to look now should potentially be able to work in both settings. So in both
settings we somehow initialize the set of possible relevant hypotheses. Naturally if we have a
top-down approach we would probably instantiate to the most general hypothesis as possible,
while we have a bottom-up approach we will instantiate hypothesis with actually the data
items of the case we handle. But this algorithm describes very simple iteration where in each
stage of the algorithm we take an take hypothesis from this list of hypotheses and we apply
the inference rules, that we have chosen to and as already said for top-down is typically Theta
subsumption and for bottom-up is typically inverse resolution, so those are the typical choices
in inductive logic programming, But anyway independently of what we choose we transform
this list of current hypothesis through systematic apply and the shows and inference rules in
one of the other direction, but in each step after we generated a new set hypothesis it may be
so that we invented - too many or redundant or irrelevant hypotheses, so therefore in every
cycle there is a proven step when we're using some criteria design to prove some of their
processes in the list. And actually then this list goes on until some stop criterion is satisfied.
On this line we seem clear elaborate a few of the items mentioned in the algorithm, so as

149
already stated there are a few dimensions here first is the direction of the process whether it's
top-down or bottom-up and that of course influence how we initialize and hypothesis
differently if we start top down and in bottom up. Also that influence the choice of operation
inference rules or operations. So furthermore there is also another distinction for every
algorithm we have to decide on which kind of basic search strategy we apply. So we can go
for depth first we can go for breadth first we can go for best first and etc. And actually it turns
out that as this algorithm is designed or written, it's actually two places in the algorithm
where our actions or choices decide controls, the search strategy that we actually plot. So
actually by the design of how what we do when we take away hypothesis delete and I
processes to expand it and also what we do when we actually do exactly when we proven
something, that combined choice we control what search strategy we will have so with so
with certain actions in these places in the algorithm we will get a depth-first search strategy,
otherwise we will get the breadth-first and so on. So final comment on this line is at the stop
criteria of course your state when we should finish and actually the most natural thing we
start when we the main thing main condition under which we stop is that we achieved at least
one hypothesis in the hypothesis layer style that is good enough.

There is one point in the algorithm after the point where we actually transformed the
hypothesis space in each iterations based on an expansion based on generalization and
specialization operate, so we've got a new set of hypothesis. So than actually at that point we
are supposed to look into whether it's possible to prove the hypothesis set. So there are two
main criteria we could use to decide on what to prove. So the first case have to do with this
space we have hypotheses that actually do not cover all instances, which means that it's too
special, it's not general enough. So each hypotheses we have is not general enough and but in
our set we have other hypotheses that are specializations of this one, then it's not meaningful
to pursue them because the first time positives we talked about it's not general enough. So
more special hypotheses then that one is not meaningful to keep. And similarly if we have the
situation that the hypothesis we have is too general, which means that it also covers some
negative instances, then it's already too general. So therefore if we have a generalization of
this hypothesis, it's even worse so it's not meaningful to keep either. There will be also other
criteria but these I think are the most primary ones to consider at this point. And we also had
a stop or termination condition in this algorithm, so actually the question is when do we
naturally stop the iteration. So one attempt which I will be in this slide to define something
correct hypothesis. So then given that when I've defined that one can say that a natural stop

150
condition for finding in hypotheses is whether if you have established one correct hypothesis
among your hypotheses. So a definition of correct hypothesis here is that it satisfies four
requirements, so first such hypotheses should be sufficient in the sense that it covers all the
positive examples we have at hand. It should also satisfy the requirement of necessity which
means that if we take it away not all positive examples are covered by any other hypothesis
so therefore it's needed we will not have coverage without it. The third requirement is weak
consistency. It means that the hypotheses do not contradict any element of the domain theory.
The last part is strong consistency, which means that their poses are not consistent with the
negative examples. So if I follow this satisfies all these requirements we will call it correct
and then it could be a candidate for being that the combined criteria could be candidate for
using for termination criteria.

Let's now look at a few examples so the first pretty short example is about that kind of
algorithm because it's based on generalization and this means that means bottom-up process
where we use inverse resolution as the main generalization operation. So in this case we start
from the facts, so we start from the facts that we have in the case. So father, adam. Adam is
the father of Kain, Adam is male, Adam is the parent of Kain and so on. So we use these facts
to infer the new rule father is implicated by male and so and as you can see it's natural to see
that as an inverse resolution because at the top you can see the example of a revolution where
father simply by a male, male adam will make us conclude that Adam is the father of Kain.
Okay, if we look in the other direction now the top-down direction, the specialization
operation is the Theta subsumption typically, so what is exemplified, here first we take it for
propositional logic is that you can see that the hypothesis space naturally for a very simple
case in propositional logic, the subsumption operation generates a lattice and also you can see
that every man every element in this that is every pair of elements in this lattice has a least
upper bound and it a grade is lower. So this is a very neat structure, as you can see in the next
slide it's the same case, this also works for predicate logic, so for logic programming we will
end with structures that look approximately like this.

To compare the top-down and bottom-up approaches let's introduce another example. So here
we have a simple example we have an hypothesis which is bird X, we have some positive
examples like that penguin is a bird, eagle is a bird, sparrow is a bird,and then we have some
background knowledge saying the penguin, a sparrow and eagle lay eggs as all birds do,
however only the sparrow and the eagle flies and only the Eagle has talents. So then if we
first look at the hypothesis space, for this example from a bottom-up perspective where we

151
have inverse resolution as the relation that defines the structure, so you can see here that an
eagle being a bird is defined by the close at the bottom. Bird X is lays eggs, flies X and has
talons. While bird sparrow need to be an instance of the more general concept part of eggs
with lays eggs and flies and so on. So by static by the facts the general algorithm would
construct hypotheses within this space if we the same fashion look at the top-down search on
the Theta subsumption,it will just be reversed. So hopefully this simple example can give you
a flavor of the basic structures that always have to underlie the inductive processes an
inductive logic programming, independently of whether they are top-down or bottom-up.

I want to wrap up this lecture by first giving you some examples of ILP systems and then
gave you two examples of important application areas. So first ILP systems, so you get four
examples here FOIL, PROGOL, Golem, and Marvin. Actually two top-down approaches and
two bottom-up approaches and as you also say they represent different kinds of search
strategies, so Foil and Golem are hill climbing while PROGOL have implemented the best
first search. Probably the most well-known systems of all these are FOIL and PROGOL.

When we talk about application areas those two strongest application they have an inductive
logic programming so far, is actually in natural language processing and bioinformatics.
Maybe not surprisingly so because these areas we have high demand for being able to
represent complex structures. So therefore logic programming provide a very good basis for
both describing the examples, hypothesis and the domain theory, and want to say there have
been few success stories in this areas over the years. This was the last part of this lecture
thanks for your attention, the next lecture five point four will be on the topic of reinforcement
learning thank you.

152
[Music]

Welcome to the fourth lecture of the fifth week of the machine learning course, this lecture
will be about reinforcement learning, this lecture will be divided into three parts. So we start
today with part 1, the introduction. For the two first sub teams we have this so far discussed
this week Explanation based learning and Inductive logic programming, these approaches
stems from a tradition of computer science and logic. What they have in common with what
we are going to talk about today is that they learn not in a vaccum, from a set of examples but
on the brink or border of a reasonably strong like already existing domain theory.
Reinforcement learning the topic of today comes from another academic tradition, actually
reinforcement learning methods as I look like now were very much inspired and have been
growing out from a long tradition in control theory, but as for the other system we talked
about reinforcement learning is not learning in a vaccum, is actually learning on the border of
an existing system with an existing domain theory. The intuitive scenario for reinforcement
learning is an Agent that learns from interaction with an environment to achieve some long
term related to the state. The state is a state of the environment of course also with the agent
within it. So this iterative learning takes place as a series of actions on behalf of the agent and
by a systemic feedback from the environment on each action. And the mapping from States to
possible actions which means is actually what controls the actions in each state is called the
Policy. The achievement of goals is defined by Rewards or Reward signals, those being the
feedback on the actions from the environment. So this terminology is very important actually
it's highly recommended I would say in this area to try to really, thoroughly learn the
terminology because these few basic concepts will come back and come back in all kinds of
forms when you look at very many different kinds of algorithms in this area. So apart from
reward which is the atomic feedback on a single action, there is also another important
concept or two related concepts, so in this area a lot of the work is centered around not just a
single action by this sequence of actions and sequence of actions we typically call episode
and an important concept then related to rewards is the accumulative to reward over an old
episode. So it's a sum not really sum because the way you aggregate the rewards can differ
but at least one can say it's the aggregated rewards overall episode, and that we call a Return.
An action sequence is the set of actions from a specific state to something we call a terminal
state and of course this has to be defined from problem to problem one means to be a terminal
state. So the goal of most algorithm in this area is to establish a policy, policy for an agent for
how to act so that it maximizes the returns across all possible action sequences. There are

153
many examples today about reinforcement learning, actually many success stories described
so on the front page you can see some slides really referring to the Alphago system (06:03)
which is one of the success stories where reinforcement learning, implementation actually
created a world champion level and Co playing program, also there are there examples on this
slide you can see a small robot that should manage to navigate the through a maze assuming
that this robot is provided with reinforcement learning system.

As I said it's highly recommended to really learn the terminology for this area because it will
simplify all further work. So let's now again go through that, term by term so an environment
here is a micro world defined for the particular reinforcement learning problem including the
agent and many times when this is referred to and we use the letter E. Then we have an agent
often designated as A. State is a particular configuration of the agent within the environment
and typically here we use the letter S to refer to it. Terminal States is defined end States for a
particular reinforcement learning problem, so for every domain the terminals times have to be
clearly defined. And then we come to actions, so an agent selects an action based upon the
current state and the policy P, which is hat (08:11)and the policy P is a mapping from states
of the environment to the potential actions of an agent in those states. So policies can be
deterministic which means that the policy depends only on S or it could be stochastic which
means that the policy also depends on a, related to that there is another concept called
Transition Probability Function which is actually the function from as an A onto a new state
S prime and this specifies the probability that that environment will transitions to state S
Prime if the agent takes action A instead S. An episode also sometimes called a epoque is a
sequence of states actions and rewards which ends in a terminal state and the reward gives
feedback from the environment on the effect of a single action A in state S is leading to S
prime. Discounted reward is concept that means that when we calculate accumulate rewards
over a whole episode, we want to implement some intuition that what we do so early in the
sequence have more weight than that we do later in the sequence. So therefore we won't put a
kind of discount factor on the later steps and the action steps in the sequence, so typically the
discount factor lambda is a number between 0 & 1. The return is the accumulated reward
over an episode, so finally then the value function or is the estimation of the estimation of the
value or utility over state S with respect to its average return considering all possible episodes
within the current policy ending as always in terminal states. So this value function must
continually be re-estimated for each action taken. So finally the model of the environment to
refer to two things already mentioned is that the model of the environment is considered on

154
one hand as the relation, the function T from S and A, to a new state S Prime. And the
rewards associated with all those steps.

To rehearse a little on the terminology once step more, we introduce a little example called
four times three world, which is actually a little table and it's a board of four times three
positions indexed by two coordinates, so the agent has a start position in one point 1, the
proposition two point two (2.2) is excluded it's like it's forbidden, there are also two rewards,
I mean actually that can be in reward for any step taken in this this kind of domain but in
some domains it's more common that you get a reward in the end, so maybe one can say that
two categories here there are these kinds of domains where you gather a reward when you
reach the terminal State or there are these domains where you get rewards more or less for
every step you take, so both those variants are possible. So here in this case you can see it's
it's one position for 4.3 that has a positive reward of plus one and there is no one another one
that has a negative reward or minus one. Actually there is no fixed rules for the ranges of
rewards, so that can be it's up to the designer and design for a particular problem, so for all
other reaching all of the positions give a reward zero. And the action is to move up down left
and right, of course the board as itself restricts what actions to that can be taken. So the policy
is deterministic which was already mentioned I mentioned earlier here so and just to repeat it
is I mean in the sense that every action in a specific take can only lead to one other state. In
the stochastic case one action can lead to different states and that can then need to be some
probability for in which state it's the same action and up with in. You can also see in this
example some exemplification of what an episode is, and you can see what return those
episodes gives in the end, I mean here it's very simple because only the last step gives the
reward.

As I've already said there is a strong inspiration on the error reinforcement learning from a
trophy (15:19) or even if the definition or characterization of what reinforcement learning is
has been developed within the machine learning and artificial intelligence feel the kind of the
models used and the methods used bear strong resemblance to those traditionally used in
control theory, so therefore it's not surprising that the way of describing and modeling
reinforcement learning scenario is inspired by such one model coming from the control
theory field which is Markov decision process but we here abbreviate MDP. So in a Markov
decision process we actually then have that kind of setup that we more or less already
described for reinforcement learning, so we can say that an MDP is like a 4-tuple of state or
States, actions, rewards and transitions where S is the set of states, A is the set of actions

155
possible for each state that the agent can choose from defined by a policy P and then we have
the reward function which is their feedback or on each action leading to some other state and
then we have a transition probability function specifying all these probabilities from S to state
S given in action A. And there is one thing that there is important to note that the coupled to
this model is this assumption that when we look at the probability, the state probability we
can forget all the paths that led to this state, so the Markov property says that the transition
probabilities depend only on the states, not on which path that led to that state. So also in this
kind of framework we have the goal for all kinds of algorithm with this is to find policies P
that maximizes the return, which is the expected future cumulated possibly discounted reward
for episodes starting from one state and moving to a terminal state and to the right you can
see an exemplification of the state and she said well you see the probability is just an example
of probabilities or specific, for a specific case yeah and it's exactly the same example that we
looked at and this four times three board example.

So now it's time to come to a very important distinction for this area. So in one MDP scenario
and I would say this default one, we have a complete an exact model on the Markov design
process, in the sense that the transition function from states and actions to new states and the
reward function from states and specific action are fully defined. So as also the various
domain description available that that can completely specify those two functions or relations
and so in this case where we have complete knowledge available the MDP problem is a
planning problem, that can be exactly solved by use of example techniques like dynamic
program. However in many cases these two relations T and R are not completely known. So
this means that this information is not available from the domain with meaning that the model
of the MDP is not complete, and this case is that what we truly call reinforcement learning,
that the complete case is more of a standard problem in control theory which can be handled
by standard techniques like dynamic programming and is essentially a planning problem
while the case but with not complete knowledge is a learning problem.

As a reference point let us first look into how the case with complete knowledge is handled
so that we have a reference point. So as I said the standard technique for handling this
situation is called Dynamic Programming which is an element design technique for
optimization developed by Richard Bellman in the 1950s. So like divide and conquer
dynamic promise simplifying a complicated problem by breaking it down into simpler
subproblems in a recursive manner combining the solutions to the subproblems to form the
total solution. If a problem can be optimally solved by breaking it recursively to subproblems

156
and then form the solution from optimal solution to the subproblems then is said to have
optimal substructure. However there are other ways or methods for breaking problems into
pieces like divide and conquer in that case the problems are supposed to be totally
independent in dynamic problem and can allow subproblems to be to some way to in some
sense dependent on each other. So this way of thinking about reducing a problem into
subproblems also affects the way we think when we want to estimate, for example the value
function, the utility function of a certain state in a Markov decision process. So typically we
define the value function recursively in terms of the value function for the remaining steps of
an episode, and also typically we do the same when we find a way of calculating estimate
some of the policy function needed. There is also an important equation that defines what is
an optimal value function in this complete knowledge situation and that equation is called the
Bellman equation.

Before we continue I must tell you a funny story, not so long ago somebody asked me
actually why is this method called dynamic programming, and then I had to say I don't know
actually I always accepted the name I just know what it is but then because I got the question
I really tried to search for the reason behind the name and then funnily enough I found this
little piece from some old article, actually professor Bellman himself in the 50’s wrote that
you know the reason I worked on a project sponsored by an by the US military and at that
time it wasn't really popular to do something very theoretical or if you had funding from the
state especially from the military, so therefore you always had to defend what you do so it
useful for the purpose you were funded. So actually as this story goes, Bellman choose the
name that looks very supposed to be look very practical so we couldn't be accused of doing
something theoretical or – mathematical. So that explains why maybe this name is a little
difficult to understand the relevance of.

So the main principle of dynamic programming as a way to recursively divide up a problem


in smaller parts and take the solutions to the smaller problem to the subproblems and
combine them to the solutions of the larger. So this way of thinking also them affect the way
we run in dynamic programming plan the calculation of the various important functions. So
as you can see here there are two functions of value function and the policy function, and
both these functions are defined in a recursive way. So this means that the value function our
state S is actually the sum of all episodes starting in S but of course via different actions
leading to other states as S Prime, actually taking the probabilities for going in these various
directions but for each direction take the sum of the reward for taking that direction add the

157
discounted value of the value function in the new state. So you see you see you start from the
discounted value of the state you're going into, you have the reward you multiplied by the
probability to get the full picture, and in a similar way you do with the policy function now
the thing is for the policy function you want advice on which argument to take, so therefore
logically you look at the same kind of expression, the value function but you want the R max
which is the A, that gives the position and the action A that gives the highest value of the
value function, which is the reasonable action to take. But also this then equation have this
recursive structure. So then we come to something called Richard bellman principle of
optimality so that says that an optimal policy has the property that a variable initial state an
initial decision are the remaining decision must constitute an optimal policy with regard to
the state resulting. So whatever step you take initially still in order to get something totally
optimal, even the rest must be optimal. On coupled to that, we also have the Bellman
equation, which is an equation for specifying what could be an optimal value function and as
you see for that the structure of that also is the same recursive structure.

Just because via Bellman equation have an equation that gives the criteria for finding an
optimal solution it doesn't matter mean that we have one, so what remains is to find methods
that that solves the bellman equation. We don't have much time to go into this at this point, so
on this slide I roughly mentioned a few variant approaches something called Value Iteration
where we actually through a simple procedure iterate the values, or policy iteration where we
do the same for the policies. Actually it's always the case that if you have a you can always
infer one from the other, so if you have an optimal value function it's pretty straightforward to
infer an optimal policy and vice versa. There are other techniques you can use as linear
programming, but we will not go more into this. I will know shortly introduce to you a simple
example to illustrate the thinking of dynamic program, so this is actually not reinforcement
learning, it's just an example to illustrate the recursive style of reducing larger problem to
smaller problems and then combine the solutions, so this is actually a graph and every edge
as a value or a weight and you can say it's a distance here so the task is to find the shortest
path. Actually as you can see here one can easily see in this simple example the shortest path
is nine, but if you have a general problem you need a method and for example if you have a
greedy method the greedy method always starts where you are and take the best choice
locally, but never backtrack. So in this case for a greedy method you take one from S you
take the edge we wait one you come to A the then you will take the edge with weight 4 but
then when you come to D you only one way to go and unfortunately you take it and need to

158
take an edge with weight 18 so then you will go with a with a total weight of 23 which is
absolutely not optimal at this point. However what you can see on the next line is that the
dynamic programming approach is not greedy, it's rather more like breadth-first so actually
what dynamic programming does is systematically in parallel checks all paths recursively,
and actually dynamic programming in this case well in contrast to the greedy algorithm
provide you with a good approximation.

So what we will do for the rest of this lecture is that we will accept that for the case where
you have full knowledge you can use dynamic programming and techniques related to that
but for the other case where you don't have full knowledge you have to rely on other methods
or maybe reduce them to a problem where you have full knowledge and then you can
continue. So we will not make an attempt to generalize the Markov decision processes in a
direction that makes them more useful to the case where your lack full knowledge, however
there are such extension and on this slide you can see one such attempt so it's called partially
observable Markov decision process and actually what the extension is very simple but
because you also introduced you notion of observation, so apart from getting a reward at
every state an agent also can make an observation which is then a partial picture of the
environment, actually it turns out but if you introduce this you this kind of model is more
useful also for the case with not complete knowledge but for time reasons we will not go into
this sidetrack either. So this is the end of part one this is soon to be continued in part two
thank you

159
[Music]

Welcome to part two of the lecture on reinforcement learning. I want to start with discussing
seeing a few important distinctions within this area, so the first distinction is passive versus
active learning, the second distinction between on policy and off policy and the third aspect is
exploitation versus exploration strategies and the fourth aspect is model based versus model
free, reinforcement learning methods. The two first distinctions are very much related, so let's
start with the first which is distinction between passive learning and active learning. So of
course all agents in all these scenarios perform actions and gets reward and so on, so it's not
that passive means doing nothing, so actually this original scenario that we talked about
where reinforcement learning still holds, with the cycle of actions, reward, etc. The difference
here is that a passive agent executes a fixed policy, which means that you can say this is an
agent will no free will or can consider it that this policy can have been given to the agent and
the agent is not allowed to change it, so the only thing that the agent can do is to act but
always according to a predefined fixed policy, but the alien (02:24) can do, can evaluate
what's happened, it can look at what rewards it get, it can look at the returns etc., and it can
also learn about the utilities of being in its various states. But this is previous circumscribed
existence. In active learning the agent can update its policy as it learns and of course there
are different modes of that, and it can do that also while acting in the world and of course an
active agent must consider what actions is takes, what the outcomes maybe, how it affects the
rewards etc etc. So the active learning scenario is where there is much more dynamic
handling of the policy and its consequences and of course active learning is not more typical I
will say for the true reinforcement learning case then and then the passive learning.
So what is very much related is this distinction between On policy versus Off policy, still one
can say that both this case is more related to an active agent because still in both cases we
assume that the agent is interested in changing its policy however in the Off policy mode, you
at least for a period stick to an invariant policy and during that period while still obeying that
policy the agent can learn in order to define a new policy that can be used in a later stage, so
sometimes you call the actual policy that you follow for the behavior policy and you call the
one which you want to establish the estimation policy. The alternative is an On policy
situation where you actually follow a policy but incrementally and dynamically change that
based on experience. So one can say that the planning and learning parts are interwoven
while in the off policy case the planning and learning parts are separated.

160
The third distinction is between exploration and exploitation, so by that is meant that an agent
when it takes actions can follow two different approaches. On one hand it could prioritize to
follow paths already taken, whether already exist experience from those paths from those
episodes, especially by priorities those episodes, those paths that gave more return. The
alternative is exploration where you could kind of go into deep water, if I can use that
analogy, so instead of just doing what you used to do choosing what was good, not doing
what was bad and there of course a lot of avenues you never tested, so in an exploration
oriented policy, there is a strong ingredients of trying new paths of course with the ambition
and hope that those new path should be more rewarding than the traditionally tried out once.
And of course then very solve an issue for the agent to decide it could on the trade-off
between these two, because obviously it can make sense to do both at various points in time
and under various circumstances. So kind of a little more static way of managing this balance
is to actually be more explorative when you have a rather weak knowledge or rather
uncomplete view of the environment but when you feel or you can experience that you have a
more substantial knowledge about the situation, you could rather fall back on the tried out
paths, so but this is more like doing this in one period and doing the other in another period.
A more dynamic method which is example is called this one called epsilon greedy really do
this judgment on with the final granularity, so actually what one do is on the define a little
window so if you have an action which satisfies, which has a probability very close to one
then you keep to that, but if you cannot find actions with such a high probabilities then you
turn to exploration in general of course this distinction makes more sense with reference to
the earlier distinctions for the active case rather than the passive case, I mean of course
somebody can give you a predefined policy where you should explore, but as you have
known fairly well especially the trade of discussion is irrelevant, and of course exploration
versus exploitation may make sense both in a off policy situation and an on policy situation
but right one can assume that this way of dynamically balance between these two makes
more sense in the online active case.
Finally we have the distinction between modern free and model-based reinforcement
learning. So we still talk about the situation where our knowledge about the environment is
not complete which means that the complete picture in terms of transition function and
reward functions are likely but actually there are then two approaches here what's called
model-based is the following then we realize we have an incomplete model of the
environment but our approach will be to first to complete that model and using certain
schemes when we think we have completed the model to the point where we find acceptable,

161
we regard it as complete and fall back on the let's say the dynamic programming approach
which we already realized works well for the complete knowledge case. So that's the model
based approach. The model free approach is that we give up the goal to complete the
knowledge, so actually we give up the quest for completing the transition relation and the
reward relation, rather we work directly with estimates of both the utilities. One example of
that is Q-learning that we will talk little more about and when you set up a special measure Q
value which is a particular kind of utility value, which you don't explicitly have to place on
the transition and the reward functions. What we will do now is that we will look various
approaches to solve reinforcement learning problem in the case of no complete information,
and we will talk a little about one particular model based approach which is called adaptive
dynamic programming and then we will mostly talk about model free approaches. We will
talk about an approach to doing an Direct Estimate, direct estimate of the value function. We
will look at Monte Carlo something called Monte Carlo situation, we will talk about temporal
difference learning and finally we will accept them exemplify that with a method called Q
learning.
So as you can see on the pictures here, you can get a flavour for very rough difference here,
so adaptive dynamic programming is model-based it falls back onto the build of a complete
model and then applying dynamic program, I hope you remember now the dynamic
programming does an exhaustive walkthrough of a breadth-first character, so obviously this
fits well with the picture see here, so it actually tests old roots up to a certain point. In
contrast to that among the Monte Carlo situation is a method where you make samples, so
you will select samples of episodes and this considers the rest and you base your estimates
just on those episodes, so one can say that amount of Carlo simulation take few roots but take
them all to the end to the terminal point. In contrast to that, the temporal difference methods
actually still follow specific paths but do that in a more shallow fashion not always following
up the whole episode, rather looking at pairs of states and the differences between pairs of
states.
So the simplest method and is actually model freely because it's in the sense that is not
actually needs the complete the transition functional or relation and doesn't need a complete
reward function, so actually this is a very straightforward method where you look at a sample
of epochs or episodes, and then you accumulate the averages along the way for each episode,
so here they are in this method they introduced another measure the call ‘reward-to-go’ and
actually ‘reward-to-go’ it's just some of may possibly the discounted rewards for each step in
the episode. This method keeps a running average over all these ‘rewards-to-go’ for every

162
episode it looks at. So and then it sets the value function to that average of all these, I mean
‘reward-to-go’ is not exactly the same as the return as we already defined it, so it's always a
slightly different thing, anyway it's somehow ever accumulated to reward and it's for all
episodes and then you repeat, you take more example depending on how many trials you take
when the trials come to infinity theoretically and the sample average should convert to the
true utility for the state, the only problem is this is very time consuming and the convergences
is very slow, but otherwise it's very straightforward.
The model-based approach we are going to discuss is a called adaptive dynamic
programming. Actually this is an approach, this is very close to what how I generally
describe the model-based approach as such, so actually what you do is to first to complete the
partially known ADP model which is to complete the full picture of the transition function
and the reward function, and then after that you apply dynamic programming as in the simple
full knowledge case. So therefore you have some strategy to learn this function and mostly
it's a straight forward approach there, so actually you collect examples of rewards for this
kind of state action triples and you collect examples also of the transition triples and you take
average of collected rewards and you calculate the fractional time such an action leads to a
specific other states, so it's basically making a lot of observations of these kinds of triples and
making some statistics based on that, so then you easily define these two functions based on
that material.
Now we will turn to a model free learning approach called Monte Carlo (MC) simulation. So
Monte Carlo methods are computational algorithms that are based on repeated random
samples to estimate some property, target property of the model. And actually Monte Carlo
simulations simply apply Monte Carlo methods in the context of simulations, so then Monte
Carlo reinforcement learning methods learns directly from samples of complete episodes of
experience. So this is a limitation this method is built on the analysis of complete one
complete episode starting with one state and ending in the terminal state. So Monte Carlo
method is model free, no explicit model of the transitions and rewards are built up, so MC
takes mean returns for states in all the sampled episodes. The value of expectation of a state is
set to the iterative mean of all the empirical returns of the episode. The two variants here
slightly different, so you know in one method you only consider the first time you come to a
new state because there may be episodes that passes state many times before it enters the
terminal but in the other approach you just select or measure data, make observations for
every state doesn't matter how often you visit them.

163
Here you can see an algorithm for first visit Monte Carlo every visit Monte Carlo version is
very similar it just modifies step four and in the middle of the algorithm, so it considered
States every time there you pass them not only at the first occurrence, for the rest the general
idea with the algorithm is you generate episodes using the current policy P and for each
generated episodes you go through the episode for each state, you conditionally add a return
item to the returns list, so in this returns list you will get as many items that you have that you
have States and then during all the iterations you successively calculate the iterative average
for all the returns in the list, and then of course in every round you set the value of the value
function of the state to that computed average or on the return, when you've done that for an
episode you will generate a new episode and you repeat it all over again until some
convergence occurs. And actually on the next slide you can see how we calculate the
incremental mean, but I don't have any more comments to that.
So here you get a simple example and that shows how the Monte Carlo simulation algorithm
actually works. It is just two states and you look at two episodes with a certain structure for
the rewards and the formalism is explained in this slide. So then you can see pretty detailed
calculation how things are done for the first visit Monte Carlo method, so that you can see
that you calculate the return for the first ten state in the first episode and then you calculate to
return for A in the second episodes and then you can see that the interesting item here is the
average of those, so then the volume function of our all becomes the average and they do the
same for the second stage and the similar calculation will happen for the every visit case, so
this is all very straightforward.
The last thing we will do in this part is to look at another formal model free learning which
called the temporal difference learning or TD. It's actually class of model free reinforcement
learning, which learn by bootstrapping from the current estimate of the value function, so
actually what this kind of algorithms does is that it's to take a sample from the environment
like the Monte Carlo situations and then perform a test based on current estimates like what
the window adaptive dynamic programming methods. So while Monte Carlo methods only
just estimate once the final outcome is known which means that you consider the whole
episode, so here the temporal difference method adjust predictions before the final outcome is
known. So actually adjusts the estimated utility value of the current state based on its
immediate reward and the estimated value of the next state, so actually it's in principle update
based on the relation to the next neighbour. So I assume the word term ‘Temporal’ is
motivated by there's normally a temporal relation when you move from a state to another and
you at the end you can see the updating equation, where you actually say that the value state

164
the new value state is the old value state, plus a certain parameter times the reward that you
get when you either take the step plus lambda which is this discount which always normally
have an in this situation, times the estimate for the next step Minus the state you are, so and
of course by choosing different values of alpha and lambda you can get slightly different
functionality here. Here so this lecture will be continued soon in part in the part 3 video.
Thank you!

165
[Music]

Welcome to the third part of the lecture on reinforcement learning, this part will actually only
be on the special time difference method called Q-learning. Q-learning is a model free of
policy temporal difference reinforcement learning algorithm. The goal of Q learning is to
learn a policy which tells an agent what action to take under what circumstances. So for any
finite Markov decision process FMDP, Q-learning finds a policy that is optimal in the sense
that it maximizes expected value of the total reward over any and all successive steps, starting
from the current state. Q-learning can identify an optimal action selection policy for any
given FMDP, given infinite exploration time and partly random policy. “Q” names the
function then Q(s,a) that can be said to stand for the “quality” of an action a taking in a given
state s. And assuming that we will calculate an optimal Q function (s,a) the optimal policy is
in the state s, is actually the maximum argument, the maximum value in the Q function table
corresponding, it's actually the action corresponding to the max value and in the Q function
table for the state s. So it's pretty simple to infer.

Here you see Q-learning algorithm in a pretty abstract form, so as you can see in the
beginning it says Q( s,a) so that is the this quality function. In practice this quality function is
represented by a table where the rows are the states correspond to the states and the columns
correspond to the actions. When you start you can have an arbitrary values there, you can
have zeros or what you like. So what the algorithm does is that it goes through a number of
episodes with the same semantics of episode as we have discussed earlier, it goes through that
this out stage by step by step so it takes the first state and then for each step it updates the
quality function which means essentially that it updates the different items in the table and
the central formula here is actually that you take the old value, then you add an expression
which is the sum of the reward for taking that step, and the discounted Max value of the Q
value for the next stage is to become to, so the max value overall new actions that one can
take from that state, and then you subtract with the old value here. So the gamma is this
discount parameter that we have seen earlier and the Alpha here is normally called some
learning rate, so essentially if you have a low value of alpha, the effect of the change for each
step can be diminished so by having an alpha which is one, you have the full effect of the
change if you have a lower value than one, you can damp the effect of the changes from step
to step. So these are this kind of normal hyperparameters that all learning algorithms have. As
you can see we with alpha as one or Gamma also one the formulas gets more simple.

166
So we will look now at in detail on a small example to illustrate how all this works. So you
can see here we have a board with a 5 x 4 board and all the elements on the rim has a reward
value of -8, those in the yellow ones in the middle has a reward value of 0, and only a single
one item on the rim as real value 8 and the blue dot shows which is the start state. So one can
enumerate the states in sequence for this purpose this learning rate alpha is set to 1 and the
gamma is to 0.5 and then we have the various actions which is north south east west. So then
you can have a look as I said the Q function is actually a table sorry I think I said something
wrong earlier and so you see the actions are the rows and the states are the columns not the
opposite. Yeah and this is the realization and obviously here in this case we have said I just
set all the values to 0. So then as you understood from the algorithm we look at an episode, so
we look here at an episode with 4 steps going from 12 to 13 to 8 to 7 and then 2, and then we
go through the algorithm and what we do is that we update the Q values in those table entries
which are concerned by this particular episode and you can see here step by step how the
calculation is done exactly with the with the parameter values we had set and the actual Q
values. So not much happened here until the end, so it's only in the first step that we get a
new Q value in one of the table entries. So it still it still very boring table, now we got the -8
in one element. Okay so then we try another episode starting from twelve and as an
alternative moving to choosing a path towards the end terminal with a positive reward.
Actually then we can make a similar calculation so we also end up with a new Q-value in
another position. So this is how it goes on and we can take more episode and the values will
change, so gradually the this table will take shape. This is just to show you that after a while
you can we will reach a state with a Q value that happens to be one of the optimal policies
and that could be depicted graphically like in this slide. The another optimal policies could
look like this and this is as another pattern. So this is an illustration of this Q learning, so Q
learning we looked at many algorithms, obviously here of different kinds, Q learning is a
pretty straightforward method to illustrate how this can work. So this is the end of this
lecture. Thanks for your attention the next lecture will be on the topic case based reasoning.

167
[Music]

Welcome to the fifth lecture of the fifth week of this course on machine learning. This lecture
will be about case based reasoning. Case based reasoning CBR, is the process of solving new
problems based on the experiences from solutions of similar past problems expressed, in an
alternative fashion CBR does solve a new problem by remembering previous similar
problems and by reusing knowledge of successful problem solving for those old cases. Case
based reasoning can be motivated by assumptions such as similar problems normally of
similar solutions, many domains are regular in the sense of successful problem solving
schemes are invariant over time. Case based reasoning should be contrasted by rule-based
reasoning, an interesting analogy here is always when you look at Law systems, so there are
some law systems in the world typically inspired by a German tradition which are very much
controlled by very strong rule systems, while the anglo-saxon legal system is typically case
based. So these are actually two camps in problem solving, probably there is no simple
answer that one is better than the other, they probably complete each other in some way.
However the focus today is on the case based approach, so actually cases are stored what we
call in a case base or case memory could be retrieved and used. When a successful solution to
the new problem is found, an adapted case can be stored in the case base to increase the
competence of the system for future problem solving. So actually this functionality is what
implements learning behavior in this kind of system. Technically case based reasoning is
primarily supported by the techniques described in the previous lecture on similarities lending
or memory based learning, particularly examples here are the schemes for distance and
similarity measures which is very important also for to apply in case based reasoning. So
important aspect of a CBR system is how the cases are represented. It's necessary to represent
both the problem and it's solution and one can say that the representation problem here is
even tougher than in in the pretty simple conceptual learning cases we looked at earlier
because a case including its solution is a complex entity, so a representation needs
engineering. So the representations are problem the basis for case retrievals and the
representation of case solutions are the basis for case adaption. So different from the
representation are the classical ones, feature value lists, graphs, predicate logic. As for all
machine learning techniques feature selection is a crucial sub process here. The choice of
representation is also crucial for which kind of similarity based learning techniques can be
applied and utilized. One common way of modeling a CBR system cycle of activity is the
following, four main phases, retrieval we try to find similar problems in the case space, reuse

168
the proposed solutions belonging to the similar problems, revise the solutions so they fit and
can solve the current problem and then take the revised and adapted case and store it back
into the case based again which we call retain in this case. Actually these goals can also be
depicted a little more in detail so we can see here we have a problem we have a new case, we
make a retrieval, where we search among all the old cases, we select a few that has some
close similarity and we try to reuse them on the solved case that leads to some revisions and
some repair, and then we make a modified copy of the case that really solves the problem at
hand and then we can store that problem again in the case space to be used for further
problem-solving in the future.

So let's talk shortly now about each of these four phases if we now accept this model of case
based reasoning. So the first phase we call retrieve, the goal is to find a small number of cases
from the case base with the highest similarity to the current case. So retrieval can be based on
different kinds of the techniques, one can choose to have some kind of indexing techniques
for everything in the base and of course different approaches to indexing, also you can focus
more on the similarity learning techniques, instance-based learning techniques, for example
apply the nearest neighbor methods using a weighted sum of features, finally there are also
other approach is something called template retrieval where you return all cases that fit
certain parameter settings. In the end you typically return a very small set of closest cases for
further processing.

The next phase is called reuse. So reusing retrieved solution can be very simple if the storage
solution can be used unchanged that's the solution for the new problem. This is not the typical
case though, so otherwise adaption is required. Two maintain techniques for adaption in CBR
are we can say it Transformational but actually takes the previous solution and using and
modify that based on some domain specific transformation operators, so it actually transform
the solution. The other approach core Derivational where you reuse not the specific solution
but the algorithm methods and rules that were used to generate the original solution, and then
you apply them again in a different way to produce a new solution. So you need to create
something different but you can do it more in a, I would say the transformation is a more
surface oriented way and derivational is more in a more deep sense.

The third phase is called revise. So in this phase feedback related to the solution constructed
so far is obtained from the environment. This feedback can then be given in the form of
correctness rating of the results or in the form of a manually corrected revised case. So the

169
retrieved case is adapted to reflect difference between the new case and the received case. If
the solution is unsuccessful the retrieved case can be repaired by using the domain specific
knowledge. After the solution has been successfully adapted to the target problem the result
can be stored as in your case in memory. The final phase is called Retain so retained phase is
the learning phase of a CBR system in the sense that is where the revised case is added to the
case base again. So depending on circumstances a revised case can be either actually retained
or simply forgotten that's also a possibility. The new case is integrated in the memories
structure either by indexing or by being positioned in the feature space, depending on if you
have a indexing philosophy or more of a similarity space or instance space philosophy for
how to structure the case base. It may also be that the case base structure may be consolidated
due to the addition of a new case by changing somehow the indexing mechanism or
reconsidering importance of features and so on. Actually when we look back we can say that
that if we talk about learning in this kind of system, there are two of the phases that are more
relevant to talk about learning, on one hand I would say it's relevant to talk about learning in
the revised face because there actually you create something new based on the experience
from the new case, and then more technically to integrate this new thing into the existing
memory structure in the retain phase, so actually learning is more related to the two later
phases while problem solving are more related to the two first phases of the whole CBR
cycle. It could be interesting to compare CBR to other reasoning and learning approaches,
one key difference between the implicit generalizations in CBR and explicit generalization in
mainstream inductive machine learning systems is concerned with when the generalizations
are made. Most inductive algorithms make their explicit generalization from a set of training
examples before the presence of any target problem, we call this eager generalization. This is
in contrast to CBR which delays is typically implicit generalization until testing time. We call
is lazy generalization. CBR therefore tends to be a good approach for rich, complex domains
in which there are multitude of ways to general a case, which is non-trivial to handle in the
mainstream inductive machine learning tradition.

Let's look at some advantages with CBR systems. Cases are sometimes the best way to
represent knowledge, especially when the available models or theories are unreliable (weak
theory domains). The good case is of an inefficient shortcut to the search for good solutions.
The capability of CBR is organically increased by the incremental addition of new cases.
Intrinsic advantages of an old case can be strengthened by an incremental improvements due

170
to the adaptions triggered by new cases. The processes of storing cases and reading cases can
be computationally less expensive than alternative learning mechanisms.

Criticism of CBR is not uncommon, especially from advocates of much more straight formal
methods. This critics of CBR argue that CBR is an approach of accepts anecdotal evidence as
its main operating principle, so that without statistically relevant data for backing up simplest
generalizations there is no guarantee that the generalizations are correct. As an example of
response to this criticism recent research has defined CBR within a statistical framework that
formalizes case-based inferences a specific type of probabilistic inference. So this is of course
some kind of defense for the method, but still this debate goes on.

CBR is a very much applied area, so not surprisingly there are many very application active
application sectors in this case. So on this slide a few of those are mentioned, the more
important ones, Helpdesk and Customer service systems, Recommender systems not
surprisingly, Medical applications in particular diagnosis, applications in Law in particularly
in the anglo-saxon law systems as I've already commented on, technical troubleshooting and
financial management and decision-making. Finally I just want to give you a very small
example how cases of this kind could look like, so here you find an example from technical
troubleshooting, so you can see you have a few stored cases and you have a new case and you
have of course a number of features as we have in machine learning in general, and as been
said here, what you need to do is you have to describe the problem in a domain-specific and
clever way, so that it's naturally easy to find similar objects in a constructive way, and then of
course the next problem is how should you also represent the solution which I normally is
even more tricky task to design a good representations for solutions because the solutions of
course should then in a later stage of the process be adapted. So there is no time unfortunately
in this part of the course to go any deeper but if you are interested in CBR please study the
further reading that is recommended and then you can get a more much more information
about both applications and concrete cases like this. So this was the end of this lecture thanks
for your attention. The next lecture - 6 will be on the topic of the tutorial on the assignments
for this week. Thank you very much

171
[Music]

So welcome to the last lecture on the fifth week machine learning course, as for every other
week this lecture is focused on some introduction to the assignments for the week, and also to
introduce some of the further readings. As always we have two kinds of assignments when
you group one questions based on video lectures and group 2 which are also based on the
extra recommended material, but let's go now quickly to group one and so you'll find that
there are slightly more questions there are seven questions you'll find this time instead of
five, and there are a couple of questions, two questions actually on explanation based learning
one is a little more of a problem solving character, and the second question is I would say
more of a recall character. So if we go on there are there is one question on inductive logic
programming which is actually an example related to the generalization specialization of
operations in in the inductive logic programming search, and then there are actually three
questions on reinforcement learning, and two of them are more problem solving character,
question five is a question related to dynamic programming and question six is related to the
model free Monte carlo simulation model. Finally there is more very called question oriented
question on case based reasoning.

Concerning further readings the volume your articles are recommended. For explanation
based learning it's a pretty neutral and straightforward article on the subject from 1987 by
Haym Hirsh, for inductive logic programming I suggested an article that highlights
applications rather than theory by Ivan Bratko, Stephen Muggleton are two of the key persons
and the development of this field for reinforcement learning it's a later overview article of
reinforcement learning. For case based reasoning you get a pretty well-known article by Janet
Kolodner who was been instrumental in the development of that area, for Bayesian networks
you get a reference to a very substantial book in on the matter you shouldn't feel you should
learn a read that, but if you're interested in that area which is not very well covered this week
feel free to delve into that book. For on the based clustering there is very classical work by
Douglas Fisher on this cobweb system.

So as many times earlier the questions in group two or more recall character, actually also
here last questions on the themes that has been prominent this week, there are also some
questions of those sub themes that were not really well covered such as Bayesian networks
and model-based clustering okay. So by this we are finished for this week. Thanks for your

172
attention the next week of the course will have the following three artificial neural networks
thank you very much.

173
[Music]

Welcome to this first lecture of the sixth week in on the course machine learning and this
lecture will divide it in two parts, but the whole week will be focused on the area of artificial
neural networks and their role in machine learning. So you heard a little amount neural
networks earlier in the course but I will repeat a few important things before we go into the
more detailed descriptions of this approach. So regarding the inspiration for this approach,
the main inspiration is the human or even animal nervous systems, this means the nervous
systems of the body but maybe from the side of inspiration particularly the brain, so the
current model in neuroscience is that nervous activity is based on very small atoms called
neurons, and these neurons they work in a electrochemical way and they transmit signals to
each other and the terminology is pretty straightforward there is a cell body, that's the center
of a new world there are some input channels they're called dendrites there is an output
channel it's called an axon, and the process where this unit sends a signal is called firing.
There is also another word use that is pretty important to know and it was called synapse
actually it's synapse is the points where the axons sending out the signals connect to the
dendrites of other, so it is the connection points between the communication elements so to
say. One very important point here is that in the animal system or human system neurons are
not homogeneous, there are a lot of different kinds of neurons in the body, so they are
tailored to specific purposes but if you abstract of course you can say that more or less they
work in the way I just described.

So furthermore all these atomic elements, the neurons are connected in huge network. So to
give you the order of complexity here you can say there are 100 billion of neurons in the
human brain, and 10,000 times as many connections, so actually you can see the million or
billion neural connections in the brain and you can see if you have an analogy with
parameters in another system you can say that the property of that connection so you have so
many parameters that you adjust in order to define the functionality of the system. While the
real thing as I described in the earlier slides is a very complex heterogeneous system. We
now turn to computer science again and define what we call the computational model for
artificial neural networks which is of course inspired by the real but very much simpler and
very much abstract more very much more homogeneous, so an artificial neural network or
ANN is a network of nodes or units that we call artificial neurons and these are connected by
edges so the corresponding graph is directed typically there is one layer of input nodes, on
layer output nodes and an arbitrary number of layers in between and we call those layers in it,

174
this is not the total truth because there may be some systems where you have edges that are
by bi-directional and where there is no big distinction between input & output, but those are
exceptions. So edge is typically have a weight that can be adjusted and this way it increases
or decreases strength of a signal coming via that connection. Actually the weights are very
important because when we took we're going to talk about learning one can say that the
functionality of the system are very much decided by the setting of the weights and therefore
by changing the weights can somehow learn. So the output of each neuron computed by some
typically nonlinear function of the sum of the inputs given that that sum exceeds a threshold.
Potentially all the neurons can fire in parallel so it is not if potentially it is a totally parallel
system but many times there are some temporal constraint put on this which means that there
are sequential set of events going on. Many times in real applications the layers are given
certain functions so this means that we don't have a network that is totally homogeneous that
a lot of things can happen anywhere is the design of the network is such that in certain layers
certain things happened in other layers such as other things happen but that's not very much
after the detailed engineering and solving a particular problem. So also typically not always
the signal travels from the input level to the output which is natural but it could be so that
there are loops which means that signals are reprocessed. However if we run a data item
through the system we get some output and then of course we have a data set and we run all
the ideas so the whole process of running all our data items or data sets through the particular
network we have that is normally called in the alien terminology for an Epoque.

The ANN model is an abstract model but hopefully you already understood that there can be
very variants on these networks, so the purpose of this slide is just to exemplify to you and
give you some of the important keywords to get a flavor of that there are many different
networks and during this week we will look at a not all but a representative subset of these
networks and I will try to explain the purpose behind the different designs. Let's for a
moment look at a single neuron, and look at what actually happens when the neuron fires. So
we have a set of inputs x1 to xn, to this particularly Nueron and for each of those connections
input connections there is a weight, and then we have a threshold and the role of the threshold
is that you can say that there is an amount of signals coming in which is calculated as the
weighted sum of the of the inputs, but the neuron will only fire if that the total amount of
signals coming in is exceeding this this threshold. Finally even if the neuron fires is just not
an normally a binary output, but the output is actually then a function of this sum of signals
and there are made of different choices for that function and which also is an important

175
characteristics of a particular network. So we want to learn eventually we will look at that
later how to adjust the weights thereby changing the functionality of network and thereby
learning. However it's not only the weight that’s important the threshold also have some
importance for the functionality, so normally when you some kind of trick to homogenize the
system so in a way you try to get it very objectively and rid of the threshold and anyway
convert it to a kind of nominal weight. So you can see on this slide in the end one can take
away the threshold, introduce a zero input always having the signal one and then you can set
the weight of that additional zero input to minus the threshold that case we call the value of
that weight for a bias. But this is just a technical transformation to also allow that we can
learn the bias or threshold through the same learning mechanism as a useful for the weights.

On this line you can just look at an example of how to use the mechanism on the last slide. So
let's look at the network, look at the neuron with a certain set of inputs 5 5 3 then we can
under set of weights that we predefined and then we can convey the weighted sum and we get
an a result which is 7 then we have a threshold of 3 so we subtract and the outcome then is 4,
which is about zero so that's fine and then before we send out the signal we also have to apply
a function and we make a choice of function and the choice we are made in this case you can
see to the right so you can see it as I do some look up there so you look at the value 4 and
then you get the output 1 in that particular case. And what you see at the bottom is a circle it
is the same thing but there you can also see how the threshold is converted into this extra
input layer. There are many kinds of total activation functions or transfer function this is the
function to transform the final output from the neuron and we will come back to this during
during the week. My comment on this point is that if you have a nonlinear problem and then
you have some problems solving model before happily at nonlinear a very naive common
sense fact is that you have to include nonlinear elements in your model in order to be able to
treat a nonlinear problem, so therefore and this goes of course also for neural networks so this
means that if you want neural network to be able to handle nonlinear problems it's wise to
include some elements in the model that is nonlinear and actually the choice of this activation
function is one of the key points where you can see to that this happen. So if you only have
linear activation function you may have a problem but if you have a nonlinear activation
function and you could design your system so it also can potentially handle nonlinear
problems.

As always for a particular approach it's important to understand how do you actually solve
problems for this approach, so in the neural network case what you have to do is you have to

176
model your problem and then you have to map that problem model on to typically the
neurons in the input level and when you have done that then you start the computation and
then you get some result in the output layer and but then of course you also have to see to that
the form of solution you desire have to be possible to extract from the output layer. I mean
the simplest case and in the usual case is that the input is a feature vector so you describe the
object or situation you want to analyze in terms of a feature vector and then in the simplest
case then you could simply map your feature vector onto neurons in the input layer. However
there are two cases that we will come to back to during the week that the input elements are
not as simple as it could be so that there are complex sequences in space or time and that
demands some special treatment, but it could also be that the objects have a totally different
form of representation, so every input in image and can also be like some speech profile that
situation we'll also need special considerations, and actually these two cases is complex
sequences and non-symbolic input items like images are the two important special cases that
have to be treated.

This slide is intended to give you a picture of how the new artificial neural network area
developed initially I call is here the childhood of the area, and essentially you can see that
there is a history here for 40 years from the 40s actually up to the mid 80s and let's start with
the end so actually the end point here is that two things happen in the mid 80s that the term
deep learning was roster we will hear a lot about deep learning but actually it was coined as a
time in 1986 by one researcher in machine learning but it was not used about for artificial
neural and is it was used for some other kind of symbolic machine learning. At the same time
more or less some researchers well-known researchers like Rumelhart and colleagues
published a key article on how you can actually use artificial intelligence for solving a
practical problem. And that article was kind of the starting point for the strong growth of this
area as a key technique within machine learning. However hero ironically they didn't call it
deep learning at the time they called it artificial neural network or connection in learning and
so in 1986. Ok so let's see what led up to this so essentially the one started in the 40s and I
would name two people who contribute it originally so McCullouh and Pitt, they actually do
use the first model of this kind and if you really look at that model it has many of the
characteristics of what people have done ever since, so it's actually a very key initial
contribution. However more or less in parallel other people like Donald Hebb presented
complementary theories for this, which is also has influenced we want also see it come back
to that. And then actually like in the next decade there were some experiments people try to

177
build actually some simple new machines, welcome the perceptron which is actually the first
real implementation and then in the sixties there were some very important results in the
neural science field on our vision systems and you will see in a later lecture this week that
that work from the sixties has really inspired the approach that now more or less are taken in
in this area to handle the learning based on images. So this is the initial story as for many
areas it goes very slow in the beginning it took 40 years from the first ideas to something
concrete could be demonstrated, but of course when it taken off it goes much fast.

Let's have a look at a few of these early works. So the really important work by a McCullouh
Pitts for the first time they try to demonstrate on how a neural like unit can perform logical
operations. So actually what they did that you just looked at a single neuron and they devised
the first attempt for modeling that with a very simple model when your binary output you
have no function that transforms that output you have inputs, positive inputs and what the
negative inhibitory inputs and inputs all the same weights and of course as a signal says
everyone it is pretty simple, also they have this rule that if you get an input one only
inhibitory input, 0 only an inhibitory input it will have a veto power in the situation. So it's
not actually identical to later approaches but it shows the basic architecture and what was also
observed in their work which is kind of interesting that they very already in their original
article observed that there are certain logical operations but like XOR that cannot be solved
by a single unit can be solved by a system of unit but it cannot be solved by a single unit of
this this kind. Another important early work was that by Donald Hebb and we will have
treated separately during this week but I would say now is just a few words about the basic
idea that happened so essentially the idea of Hebb was the following that if you assume that
the that the neurons fire or act in parallel it is a fact that if two neurons fire at the same time
and they are connected and that parallel firing of the two connected neural would normally
strengthen the connection between, which means that the learning that should take place is to
really increase the weight of that connection one is one and the other is zero and then one
should typically lower the strength of that connection in in the learning process so actually
then Hebbs devise the model a formal model for that kind of basic idea. So I can say it's
another way of philosophy for updating ways than in some of the other systems. Finally a
word about a kind of interesting system that was built in 1956 actually the same year as the
area of artificial intelligence was coined, and it's built by one of the pioneers in artificial
intelligence Marvin Minsky. So what actually he and his group tried to do is not a device a
software system but actually build a machine, so they built a machine of 40 synapses which

178
as you may remember then is the connection point between the output of one neuron an input
of other. And as you can see on this image this is a picture of one of the synthesis so actually
there was a physical machine consisting of 40 of these things and that was complicated and
electrical mechanical connections among them, so as far as I know this is the first attempt to
really physically implement the ideas of artificial intelligence. So this was the end of this part
so we will continue in part two

179
[Music]

Let's now start part 2 of this first lecture on artificial neural networks, and I will do so by
talking first about some similarities and differences between learning in one hand on
symbolic systems which we spend some time on early on the course, and learning in sub
symbolic systems which we will focus on this week. So as you may remember there are a
variety of symbolic representations that we mentioned, some of them more some of them less
logical representation, factual representation, object-oriented ones specific ones like
Production rules, Decision trees, Bayesian Networks and Semantic Networks and so on. And
the two sub symbolic systems that we touch on this course are artificial neural networks as
for this week and earlier mentioned also genetic algorithms. So if we look at the differences
here one can say that when you implement learning mechanism in a symbolic system you do
it normally as an add-on to a static kernels so the static kernel is programmed and then you
add on these other mechanism where you can adapt that static kernel to achieve the adaptive
duality. While in some symbolic systems typically the learning mechanism are the core of the
system as a whole, so therefore you get the impression is also true that problem-solving and
learning are more tightly integrated or intertwined in a sub symbolic system. So that's a clear
proof of sub symbolic system, there is all some con and one of them clearly is the following
that what is learned and how it is learned is normally very concrete and explicit in symbolic
systems. So you can actually look at the system before you learn and they can offer and you
can see the difference and you can actually more or less kind of read it. While in a sub
symbolic system which are made up of these tiny processing elements like in neuron or gene
in the genetic algorithm it's not trivial to see actually what has happened, and it turns out that
at this moment where sub symbolic systems start to be used a lot, this is easy the key problem
and there are a lot of work in that area in order to interpret what has been learned, even if the
system as such is very successful. Also one can say there is a depth following difference in a
symbolic system you literally reshape and extend the current structures when you learn,
which means that you in a symbolic system if you want extensive learning, changing the
system not marginally but more drastically, you really have to make an extensive restoration
of system, you have to add things, you have to add structures and that is not always trivial.
The some symbolic approaches is in that very clever because actually what you do is that you
pre-allocate empty space to say, because if you represent everything in uniform form you can
say, ok let's play allocate a lot of space for our system, but initially we only use a subset. And
then we take the rest introduced when is needed. So we can say the structure is already there

180
but it's not used to its full extent. I mean another can be actually when people talking about
the brain or the mind is that the brain are used to certain extent but maybe not fully utilized
but because there are potential reserves in the brains neurons structure for more connections
you can take it into use. So there is this distinction between parameter learning and structure
learning in machine learning, and I think one can phase the approach in your networks that
you convert structure learning into parameter learning and because parameter learning is
normally considered simpler that's a positive thing. If you only want to learn parameters in
the first place the differences between the symbolic system and the neural system is also not
so great.

Learning in artificial neural networks is normally accomplished through an adaptive


procedure known as a learning rule or algorithm whereby the weights and also the biases of
the network are incremental adjusted. Potentially also other parameters the so called hyper
parameters we talked out earlier can be learned and they are such I mean the function actually
that controls the transfer activation is an example of a parameter that can be modified but
there are also factors like learning rate, now the parameter is occurring in these models and
they are not part of the system in the sense that the basic learning rules handle them but
additional rules can be added for that purpose. For several reasons the learning process is best
viewed as an optimization process. You probably get the same picture again here I mean we
have talked about this earlier, it's very common in artificial intelligence to look at things as a
search process, and with the purpose of optimization. So actually here what we do is to make
a search in a multi-dimensional parameter weight space where a tentative solution is
gradually optimizes a pre-specified objective function. So the term objective function means
that we have a function that map's on a event situation on to a real number and intuitively
many times it represents the cost for achieving what we have. So it could be called the cost
function or loss function and of course always we want to minimize the cost. However one
can design this in the opposite direction so we can have the cost function we can have a
reward function or a profit function or a fitness function or something positive and of course
that in that case if we design it like that then the optimization is maximizing that kind of
function, but it's all up to us and we want to engineer it.

The choice of using artificial neural networks as a representation or something else, is


orthogonal to the categorization of machine learning techniques in supervised reinforced or
unsupervised. So actually ANN can be used for all these kinds of three purposes. So it's also
for the supervised case the input received from the environment is associated with a specific

181
desired target pattern. So then the weights are synthesized gradually as an updating when we
treat data item per data item, so that when we went through the data set we have available
data of the ways have stabilized in such a way that they may minimize the actual
classification or regression error with respect to the predefined target reference point. But
also the neural networks can be used for unsupervised learning, it can be used for clustering
of course the architecture of that network and that is fit for that purpose would be differently
engineered. So therefore as you if you remember from this slide where you saw all the kinds
of networks that is possible or have been designed some of them are designed for the
particular for the supervised case while others are more typically designed for the
unsupervised case. Final reinforcement learning can also be handled and I would say that
roughly the ANN architectures used for supervised learning can be rather easily adapted to
the reinforcement learning case I mean the difference essentially in the reinforcement case is
that there is actually not target preference for the value you want to achieve it's rather reward
or grading from an external environmental or trainer on the quality of the result. But that can
easily be remodeled in the same kind of framework as for the supervised learning case.

Every artificial neural network model have different learning rules. So what I want to
exemplify on this slide and actually just two common types. So what you see if the bottom is
this on this slide is variance of what is called the Delta learning rule and the reason for the
name Delta is that at the center of this approach is that updating all weights because we when
we talk about learning and we talk about normally updating of weights. So for the updating of
weights, a key factor is the difference between the target output the reference value and the
actual value produced by our network. So in the middle you can see T here which is the target
and Y which is the outcome. So the difference because to which you could call the error of
every problem-solving process are at the core of the calculation of the next value, so it's as I
say it's a big difference it will be a big change so it also depends on all of the sign, so any in
any kind of difference well a fact the way it's different. Okay so in contrast to that approach
which is widely known and there are many versions of that, there is another style of updating
which I'll call you the Hebbian learning rule which relates to this basic idea I mentioned from
earlier idea by Donald Hebb, that what matters for the updating weight, is the simultaneous
compatibility of all the firing of two connected Neuron, so it's also if they fire at the same
time that should be the basis for in increasing the weight of that connection but if they don't
fire what they find opposite directions that should be the basis for a diminishing or decreasing
the value of the weight on that connection. So when you can see on top you have the factor Y

182
and X which are actually the output from two neurons in that case and as you can understand
if those values are compatible and positive values will increase away if they are have a
opposite direction you will have a negative effect on the update. So these are two genres of
updates but as you can see and that the main message is learning is updating awaits.

Looking at the Delta learning a little closer and as I mentioned the Delta learning rule is the
kind of rule that controls the update of weights in a lot of cases, in general for most cases for
what is called feed-forward ANN which includes the perceptron, a single layer neural
network multiple layer which we put back propagation or so on, so it's a pretty wide range of
different signs that were more or less the Delta learning rule is used and actually two main
principles behind the Delta learning rule. One principle it is based on something called
gradient descent and there is also some way of calculating the error, which is the basis for the
update, that approach is called the mean square error. So I will now shortly say something
about these two phenomena.

So the gradient descent approach or algorithm is an optimization algorithm for finding a


minimum of a function. So you know sometimes it's called Steepest Descent. The gradient is
actually a multivariable generalization of the variety I mean you have a space of any
dimension I mean in a two dimensional space you have the diverter but in the highest place
you have it's generalization this means that you can take the derivative in different directions.
It's a vector valued function as opposed to derivative which is a scalar value and the gradient
is represent the slope of the tangent of the graph of a function, more precisely when in points
in the direction of the greatest rate of the trees of the function. To find a local minimum using
gradient descent on takes steps proportional to the negative of the gradient at the current
point. So as I earlier mentioned it's when we model our problem we can model it as a
minimization problem, we can probably mark the same model the same problem in another
way so that we can construct a maximization problem. So if you have constructive model or
problem in the later way then we actually want to maximize function we want to climb you
don't want to go into a valley, we want to climb a hill and in that sense if we do it like that
and we can talk about gradient ascent rather than gradient descent naturally. But I mean the
thinking is the same. I mean there is a technique we mentioned called Hill Climb and actually
Hill climbing and diving into a valley in this case are similar but not equivalent because the
pure gradient approach strictly prefers to go in the steepest direction, so if you are on the Hill
you find the steepest point. While hill climbing selects the most promising next state, so I
mean if you have an analogy with the skier probably the hill climbing or it's reverse is more

183
natural for a normal ski you look down and you usually have feasible is to go somewhere, so
you find a feasible state you're trying to find that you don't necessarily go on the steepest part
actually or rather the opposite while steepest central wedges and always go for the steepest
route to go down. Both of these approaches are Greedy in the sense they do a local
observation you stand by your stand, and then you look around and either you go for the
steepest or you go for the best place. But you always take a local decision which means that
when you come further down you may have seen that actually enough if you want the wrong
way because if you view it from a goggle angle there are deeper valleys or higher hills.

The mean square error is a statistical measure that is pretty common and actually the mean
square error is to look at another variable and a target value for that variable and then
compare that for the outcome of some measurements. And if you make a series of
measurements of the same variable you take the difference between the target value in each
measurement and you take the square of that and then you get a series of squared errors and
you take the sum of that and you divide by the number as with how many measurements yeah
so it's pretty straightforward. So this you can somehow apply in the context of an artificial
neural network where we look at the error between or the difference between the target value
for foreign analysis process and really the output from the network. So if we have a single
day time then it will look like simply the difference between the two scalars of the target and
the outcome, and the square of that divided by two. And if you take a series of value for
Epoch I mean a number you do this for a number items of course you can then do as above to
take the difference between all the outcomes from different data items you square them sum
them divide by N and you divide by 2. So then you can ask the question why do we divide by
2 and actually that's an arbitrary choice, I mean you can define very much as you like, oh
you're inspired here in this case by the mean square error method that just so about why
divide by 2, it's because if you take the derivative which we typically do here for example in
the Delta method, then as you know the derivative of our square is two times the variable so
therefore by dividing by two we get rid of the other two. So this simplifies our calculation so
it just put there for convenience and it's arbitrary social but for convenience.

As I hope you understood I try to keep this course a little mathematical as I can, however it's
quite obvious that for the feel of machine learning artificial neural networks in particular, it's
very difficult to keep entirely out of mathematics so these things creep up, so I collected on
this slide some areas from mathematics where is very wise to have some prior knowledge if
you've further studies on this subject these of interest. So matrices is an important area

184
vectors coupled to that linearity issues, inner and outer products and show similarity are with
difference in metrics and where you we call the chain rule of derivation are important and
also some geometrical science are important because many times you want to look at your
research space for example with some geometrical or spatial views also graphs of course you
can see then are important the basic ideas and technologies perhaps simplifies life. But I'm
not going into this I also point at it and I want you to understand that you have a problem in
some area you it will be biased to go back to some lectures illiterate or literature all these
particular mathematical errors and make a little study there, then you can continue with
machine learning. It doesn't have to hinder you to go forward in this area.

Artificial Neural Network is a complex area so for that purpose I really tried now for the
purpose of this week to make a little map, so we now started we may talk about the
fundamentals, so we are here you can see at the top. So below you can see what we have
before us this week as you can see in the middle we will talk about what I will see the main
stream of work in ANN which is the kind of feed forward networks where you strictly go
from an input layer and hidden layer to an output layer. To the right you can see a kind little
different tradition where we which is really inspired very much of what was already
mentioned about Hebbian learning. So I would say that this corner of neural network we call
as associative memory, and I mean there is no short formula here but that roughly one can say
that the right part here the search in memory is more related to handling unsupervised
learning, while the feed-forward cases are more related to the genre of supervised learning.
Reinforcement learning is more in like in the middle and but I would say it's more natural to
adapt to supervised learning methods to handle reinforcement learning then the unsupervised
case. However artificial neural networks can be used in all cases depending on how your
engineer it. So the new next two parts of this week is this part which is called recurrent neural
networks which essentially is about how to handle sequences in temporal data. And the third
part or the fourth part is what is called Convolution Neural network which have to do with
special mechanisms to handle perception issues particularly imaging. So then in the end we
will try to tie things together in lecture. So this is a structure we will follow and yeah finally
just a few words about the final deep learning as I already said it was imagined in 1986
actually it wasn't used in 1986, it was only in 2000 that anybody used the Deep learning for
artificial neural network and it actually wasn't widely used until 2012 when there were some
real success stories by application of this kind of systems, then the word deep learning was
started to be used of this kind of systems. Now it's a dominating term and one can discuss

185
whether is just a term for the state of the art artificial neural network or is something more
specific and this we will discuss shortly in the final lecture. So that was the end of this lecture
thank you very much the next lecture 6.2 will be on the topic of perceptrons.

186
[Music]

Welcome to the third lecture the sixth week in the course machine learning, today we will
talk about the model of a single neuron in ANN. We are now in the middle block of lectures
which all handles the kind of neural network that we called feed-forward networks. We did
study the perceptron which was a precursor, today we will study a single neuron as part of a
multi-layer network and then finally we will look at the full multi-layer problem and the
learning algorithms needed for that.

An artificial neural network has typically many neurons and several layers in contrast to the
single neuron perceptron that we discussed in the earlier lecture. In this lecture we will study
a single neuron but of the kind that builds up complex networks today. We will look at how
the neuron performs in a forward feeding manner and how its input weights are updated in
each cycle. We will use the same kind of examples as we used for the perceptron, even if we
study the neuron in isolation it will perform in the same manner as part of the larger network
An ANN network consists of units and connection between units. A is considered the
predecessor of B and be the successor of B in such a network. The output of A is the input to
B etc. The behavior of input-output units are special in the sense that they simply output
external input, so to say without any process they are introduced to create a homogeneous
model. A slight disclaimer about the learning function in this case, so when we will talk about
the learning in here which is the Delta rule essentially that learning function is relevant for
the single neuron case. In the multi-layer case we will have a neuron will be even the single
neuron will be handled differently because the learning rule in that case will be the back
propagation however.

At this point in time you have seen a model of a neuron many times, but obviously we will
repeat it here, so you have a body of the neuron you have inputs you have weights, you have
a summation function where you sum up in every phase, the product of the input multiplied
by the weights. Then you have an output and that output depends on the sum is greater than
zero in that case, the output is not just for sum but a function applied to that sum and that
function we call activation function and we will come back to that in a minute, and that
results in an output but we are also normally for supervised learning we have a target value
for the output to compare it. So the difference between the target value and the output value is
the basis for error estimate that we call E. And as you see the threshold of the neuron is
handled as already described through an extra input to the neuron.(04:13)So the core

187
computation of ANN which is also something you should be familiar with now, is that you
assign weights to all the inputs you decide on a threshold, you remodel the threshold in the
way we described in terms of an additional input, you can then sum, make the weighted sum
of them of the inputs and then the output is decided as the activation function of that sum if
the sum is greater than 0 otherwise put is 0.

That's what you heard the last few slide was more or less repetition for good or for bad. Now
we come to some new parts. So the big difference between the neuron we talked about now
and the perceptron is the introduction of the transfer function. As you can see here to the top
left you can see a step function, so if you choose activation factor as being the step function
you get the same functionality as the present. But there are all other choices and I'm now
going to comment on some of the aspects you should think about when choosing the
activation function. Sometimes the activation function is called a squashing function because
for the reason that certain such function squashes or saturates values as a depth asymptotic
end, obviously not all do but some do so therefore this name have come up. So if you look at
the various aspects here, so one aspect is non-linearity. One can actually show that if you
choose a nonlinear activation function then a two layer neural network can be proven to
approximate a universal function. The identity activation function with just leaves the input
value untouched doesn't satisfy that property, so actually by including this kind of non-
linearity in our model we are able to handle also nonlinear problem situations. Also by choice
of function we could introduce a finite range, and actually that kind of situations tend to be
more stable and efficient. Another thing that is important is that this function we have is
continuously differentiable, because you will see a little later here that the weight updating
mechanism we have is actually based on a gradient based techniques which demands
differentiability on components of the model. In some cases one can handle of course the
situations where the segments are differentiable so it could be cases where there are some
having singular points that can be handled in a special manner. Another property of normal
function is monotonicity, so if the function is monotonic, you can also guarantee some nice
properties. Also the in the same fashion also the properties of derivative are important for
ensuring stability and efficiency. The special circumstance is near origin, so if the function
we choose approximates the identity function near the origin, we can battle handle small
numbers and as you can we will see later one of the big problems we have with all weight
updating methods based on the gradient method and we have the problem of the vanishing
gradient which means that if you have gradient that becomes very small then we came to a

188
situation where the ways are not updated at all, so the weight updating gets stuck due to the
gradients approaches zero.

So actually then the vanishing gradient problem is a difficult found especially in the gradient
based learning methods which includes backpropagation. So the way it's received an update
proportional to the partial derivative about the error function with respect to the current
weight. So the problem is that then the gradient will be vanishingly small which actually
effectively preventing the weight from changing as well. So the problem occurs because even
if singular gradients or individual gradients are not too small in this technique these gradients
are combined using the chain rule, so as a total result you may end up with a small value. So
this problem has been known but it was even more highlighted pretty recently when people
started to work with those common recurrent networks because recurrent networks tend to be
much deeper and obviously the problems worsens because the chain rule will be applied
many many many times. So this is one of the problems that exist for all these kinds of weight
updating methods.

Let us now turn to the weight updating rule in this case called the Delta Learning rule. So this
rule as has been said is based on the gradient descent method and an error message based on
average square errors. So in this case the error measure is assumed to be the square of the
difference between the target value and the output value divided by two. And the Delta rule is
actually derived it's not the derived here, you will only see the outcome, but the starting point
for a derivation is that you look at the derivation of the error with respect to the specific
weight value you want to adjust, as you can see the convenience of the number two, the
denominator on their measure because when we make this derivation the derivation of the
square, the two numbers will cancel each other and we get a simpler formula. So the Delta
rule becomes as follows, as so the new updated value is the old plus a learning rate parameter
as I have understood right now that all these methods always have a learning rate parameter
that may in a way damp the effect of the church (13:16) and it's up to decide whether it's one
which is means no mechanic (13:25) or a value between zero or one. So the learning rate
parameter multiplied with the difference between the target value and output value, times the
derivative of the output function with respect to its argument which the argument is the sum
the weighted sum of the inputs and then finally also multiplied by the input value for the
specific connection. So the intuition here is of course that the part from making the update
dependent on the difference between reference and output, and the fact that it could be
affected by a damping learning rate parameter that you are the factors here to be explained

189
and the reason why you introduced also the input value here is that you want to kind of
normalize this with respect, so it shouldn't be so that you update the weight, it should be so
that the input value in a self effects the updating. So by in including the input value you kind
of normalize the effect among the different inputs. The derivative of the of the transfer
function is there because the whole formula is derived through the derivation of the error with
respect to the weight where were actually the transfer function is a component. In the linear
case this derivative is one, so when you take the derivative of a linear function it's one, so
therefore this factor disappears and you get a simplified function at the bottom.

So here a very simple example it's exactly the example that we used for the perceptron but
you can see the difference we have chosen here this activation function at the top right and
obviously the introduction of that function will affect the outcome on the result, so you see
that Y here we can becomes 0.16, because as you see the function is linear in that region and
then at the bottom you can see the updates of the of the weights because as you remember in
the linear case we can use the simpler simplified formula and we do not have to care about
the derivative, also it's assumed here that the learning rate is 1 also to simplify expressions at
this moment. Ok so you also get another example here which is also the same as for the
perceptron and you can see with three inputs and the result is shown in the tabular form. So
this is the end of the lecture we have a look at the mechanism of single neuron and how
weight updating can take place in only considering a single neuron. So now in the next
lecture we look at full multi-layer the case and the learning record in that case is back
propagations thank you for your attention goodbye.

190
[Music]

So welcome to this fifth lecture of the six week of the course in machine learning. The theme
of this week is Recurrent Neural Networks, so looking at our map for this week you can see
where we are we really in the last three lectures we looked into feed-forward networks with
back propagation, and actually recurrent neural networks is an area which is clearing within
that tradition. And the key point here is the problem with the networks we looked at up to
now, not necessarily are very good at handling complex data sequences and temporal starters.
So recurrent neural network have the ambition to remedy those weaknesses. Recurrent neural
network is a class of artificial neural network which are able to handle sequences or complex
data in space and time. This allows RNNs to exhibit temporal dynamic behaviors. So unlike
feed-forward neural networks RNN have a memory, some kind of memory or persistent state
that affects forthcoming computations. It should be observed that many RNN systems
implements memory indirectly by unfolding of the network into time step using hidden units.
So I can say this is a stateless fashion and we will come back to that in more detail. Only
some RNN has explicit cell states which we call the Stateful fashion. So RNN can use their
internal state independently on how others are that is implemented to process temporal
sequential structures, and varying lengths of inputs and outputs. This makes them applicable
two tasks such as handwriting recognition or speech recognition. So one could say that the if
you look at the two big areas actually today of machine learning, one has to do with language
and speech and one have to do with images. So one can say that RNN really contributes to
make this kind of networks efficient in language and speech domain, while will come later
when we look at convolutional networks we will say that they target at the goal to make a
RNN efficient in the image analysis domain. So RNN performs the same task for every
element of a sequence that's the reason for the name, recurrent and that's very important
because if one should treat elements of a sequence differently that would create a very very
complex machinery. So the elegance here is that you can handle elements in a sequence but
you treat them in a similar manner. So finally our RNN have cycles in contrast to ANNs.
RNN takes both the output of the network from the previous time step as input and uses the
internal state from the previous time step as a starting point for the current time step.
Obviously there are many different situations where we have more complex input then we
have used to be handled in the ANN case, so you can see at the far left of this slide we can
see the really clear one-to-one mapping. So we have one we are well defined objects, discrete
objects that map to some output values or some output class. Okay so then to the right you

191
can see different more complex situation so once situation could be that travel there complex
input object like an image with you really in order to handle that kind of objects you have to
divide it up you have a single input but you need some ability to look into that object and
divide it up in some fashion. So here so you can say the second case is you have a complex
object and you have to map it on a certain number of values, of certain number of classes. So
another case is you explicitly have a single and explicit sequences input could be a word, and
you want to map this sequence of word on to something like an opinion (05:23) or whatever
it could be. A third example is that you have a sequence of input like in a sentence in
language and so for example machine relation you can say this sequence of what should be
translated in another sequence. So that is another situation. So finally we have the more
complex case you have many inputs, you have many outputs could be happening for example
in in video analysis and applications like that.

RNN has received substantial interest in the neural network world and can also boost many
success stories in particular in the context of applications in language and speech. As a
consequence of its popularity the area has many alternative lines are development that are
totally trivial to follow. So it is not so easy always to understand the limits of this sub area,
therefore we will start to discuss what we call a Vanilla RNN the single layered RNN that can
be unrolled and replaced with strictly feed forward acyclic neural network. Requirement for
the Vanilla is that it but time can be discretized which in turn requires that the duration effect
of single neuron activities are finite, also can finite Impulse Recurrent Network. But a
network that lacks this property is called an infinite recurrent network but obviously we will
focus on the finite type. So after looking into Vanilla RNN and we will look at two natural
extensions of that, so one extension is to allow dependencies not only backward in time but
also forward. The vanilla RNN enables us to at the later stage to look at earlier states were
actually at earlier elements of the sequence we study, but we could also reduce mechanisms
that we could look forward in time, so that's what we do in the bi-directional case. But you
can also we can also stack RNNs, you know having multi-layer systems and one reason for
that is that if we look at an application like in language we have letters and letters for word is
a sequence of letters but you can have a sentence with sequence of words and then we can say
you have a paragraph that is a sequence of sentences. So actually when you got a long
sequence on language input you don't only want to look at sequence on one level you may
want to look at all these sequences on the various levels of abstraction. So therefore one can
think of that you have one RNN neuron on each of these levels and obviously the layers can

192
be literally involved as for the single layered case. So necessarily these two extension doesn’t
destroy the nice property of the vanilla RNN it can be unfolded into a strictly field work.
Furthermore the RNN does not solve, the vanilla ones does not solve but rather makes
problems like vanishing gradient or exploding gradient even worse. Vanilla RNN also has a
problem to handle a very long sequences, of course many times due to that long sequences
rate very deep networks and deep networks can open spawns. So we will look into one
attempt to handle these problems Short Long-Term Memory as SLTM which introduced a
more complex machinery its neuron, essentially by having the purpose of getting better
control of the signal processing in and among the units. So finally what is worth mentioning
is that there are some architectures that are classified as RNNs but are not able to handle
dynamically handle input sequences, however these systems have cycles and they have
internal memory so that's the reason for including them. An example you know such is an
associative memory architecture like a Hopfield network we also will discuss separately later.

Let's now look into what it means to unfold a vanilla RNN. So we consider the case we have
multiples time steps of input, multiple times that of internal state and multiple times that's our
outputs, which actually happens when we need to be there when we handle a sequence. So we
can unfold the original RNN which has a cycle into a graph without any cycles. So this means
that we in a way duplicate, we create copies of their unit where the cycle is removed, so the
output from the first instance goes into the second, the input of the second goes into the third
and so on and eventually output in each stage then of course could be fed upwards from any
of the states to two other Network structures. So we can see that the output Y an internal state
“U” from the previous step are passed onto the network as input for processing in the next
time step. So biases also of course is still there I mean biases is as a instinct inherited from
the general concept of neural network but we will this consider that for the moment because it
would just block the site. So a key thing here is that the network does not change between the
unfolded time step, the same weights are used for each time step is only the outputs and
internal states that differ. So exactly what we do is replicate the exact structure without
modifications. One can also remodel so as you see in the middle here when the most natural
thing is to view this on the same level but then you make copies for each instance of the
sequence or instance in time but one can also reshape that so graphically one can say that
what this does is that it's add to the depth of the network, so that the copies is that going our
layer in our sample they are layered in a in a vertical way but with this is very just different
ways of depicting it. The important thing here is that by this transformation it is done in a

193
great way ensures that the result of the unfolding becomes just a normal feed-forward
Network. Of course by unfolding and times if we have a long sequence we create a very deep
feed-forward Network one player for each input sequence.

On this slide you can simply see another example how have the unfolding that can looks like
but you can input in RNN for every instance you get output them from every instance but you
also feed the internals of the internal state from instance to instance. The consequence of
unfolding for the learning process as follows, so if you carefully do what we described in the
last slide and you actually get a straightforward feed-forward Network, it's potentially
possible to still update the weights that means to learn through back propagation. So in the
back propagation of error will given time step depends on the activation of the network at the
prior time step, so error can be propagated back to the first input times that of the sequence of
that error gradient can be calculated the weights of the network can be updated. So with only
marginal modifications we can talk about then back propagation through time, and it doesn't
matter whether we backpropagate through the normal layers of the network or we
backpropagate through those unrolled levels of the RNN, I mean of course there are subtlety
here, so for a recurrent network the loss function depends on the activation of the hidden
layer not only through its influence on the output layer but also through its influence on the
hidden layer at the next steps. So small complications are more or less the same procedures
can be applied.

The most natural extension of the vanilla RNN is to introduce many levels of RNN or stacked
RNN. Obviously each layer in the stack can be individually unfolded as for the single layer
case. The main reason having many layers is that the layers correspond to different levels of
abstraction or aggregation for the sequential and temporal data items. As an example in word
processing the first layer can model sequences of characters, the second layer level sequences
of words, the third level sequences of sentences etc. The risk by increasing number of layers
of course is to make problems like vanishing gradient even worse but we still because we
unfold as you see I mean every layer is still considered to be possible to unfold, so this means
that the resulting structure is still then a conventional feed-forward network, but with some of
its problems increased of course. The next extension is what is called bi-directional recurrent
neural networks. So by introduction of that the output layer can get information from past and
future states simultaneously, means both backwards and forward links are enabled. So the
principle in realizing that is to split the neurons of a regular RNN in two directions one for
positive time direction for States and another for negative time direction. So one can use

194
there's a second step of unfolding, so you have the normal unfolding to handle the sequences
but in order to being able to look both forward and backward for every already unfolded
sequence you duplicate that with one copy for the forward direction and the other for the
backward direction. So actually the these two hidden layers are can both be connected to the
same output and they are independent of each other, that doesn't mean that one direction
feeds something into the other, so they can be treated separately and be used to give
information to later levels one by one. Of course a natural extension now is to combine a
stacking of RNN by the bidirectional model, so nothing prevents that even bidirectional
recurrent networks can be can be stacked. So obviously to sum up we are still working with a
scenario here where everything we talked about can be unfolded and the net result of
everything we have done would still be before feed forward network, which is a certain
beauty about it.

So let's now go back a little and look at some of the challenges for neural networks feed-
forward ones and in particular also for RNN. So there are three kinds of problems the
vanishing gradient problem, we talked about that already and the problem there as we already
said is that the gradient become too small and the reason for becoming a so small is that the
individual derivatives or gradients are combined using the chain rule and the chain rule is
applied one time in every step so if you're many steps and you start with the gradient in range
of zero to one. Very rapidly they approach zero and the effect of that is that the update of
weights are stuck. The opposite problem is that you can also get very large error gradients
and this kind of large updates have the effect that the network can become very unstable. So
there are ways of handling that too and in many cases it's many times more difficult to defend
oneself against the vanishing gradient problem. Also one issue it is the very long sequences
and temporal dependencies, of course they are not a direct problem but they indirectly create
a problem in particular in both problems of vanishing gradient because they create very deep
networks, and these very deep networks are than hampered by the other mentioned
performance. Long short term memory networks LSTM are extensions of recurrent neural
networks which basically extends their memory function. The core of the approach is to
elaborate the interior of a vanilla RNN and unit with the purpose to increase control of signal
flows. LSTM was explicitly designed to combat the vanishing and long-term dependency
problems LSTM was introduced by Hochreiter and Schmidhuber in 1997 and the unit's of
LSTM are used as building blocks for the layers of an RNN which is then often than as a
whole called LSTM network. So LSTMs enable RNN to remember their inputs over a longer

195
period of time this is because LSTM contains their information in a memory that is much like
the memory of a computer because general computer because the LSTM can read/write it
information from its memory. LSTM’s are widely recognized academically as well as
commercially extensively used for speech and language processing by companies like Google
Apple Microsoft and Amazon.

So hopefully this slide explains the difference between an RNN vanilla unit and a LSTM unit.
So actually you can see that apart from the normal signal processing functionality of the RNN
neuron the LSTM neuron has a much more complicated interior and essentially the difference
is that the LSTM unit apart from anything else has some gating functionality that controls the
signal flow.

So the LSTM memory can be seen as a gated cell, where gated means that the cell decides
whether or not to store or delete information typically by opening the gates or not based on
the important is a science to the information at hand. The assigning of importance happens
through weights which are also learned by the algorithms. It simply means that it learns over
time which information is important which it's not. Specific to LSTM is the cell state
manifested by the horizontal line running through the top of the diagram of the unit. This is
manifested by the different “C”, “C” for cell state. Also in a LSTM unit you have three gates
this case determined whether or not to let new input in input gate, delete information because
it is an important forget gate or let it impact the output that the current step output gate. So
that's the basic functionality of this kind of unit.

So some comments on structural of aspects of LSTM. So in principle LSTM's can ask for
vanilla RNN be unfolded, be stacked in a multi-layer structure, be arranged in a bi-directional
fashion all these are possible. If this is done in a stateless fashion that no cell state is use, that
is the state information just flows across. Then the resulting network is the normal feed-
forward network potentially with backpropagation, I mean so however if the LSTM cells use
explicit cell states stateful fashion it becomes more clear and there is no guarantee that
exactly this feed-forward Network me can be used as in the default state. But nothing
prevents to combine these elements in more complex structures as emission here. So what we
have done in this lecture is to look into some of the more important aspects of RNN, so we
looked at we call oh we call the Vanilla type that Full or standard RM that can be unfolded.
We also have talked about how that can be extended with will multiple layers with bi-
directional links and we then also showed that one approach well known approach how the

196
signal processing can be improved in the LSTM case. There are so many different various in
this area that and below in this line you can see a few of them, many networks that are
classified that there are RNNs and there are so many approaches moreorless these different
approaches relates to the simple ones so for example a gated recurrent unit GRU is a simple
version of LSTM. Elman networks are very early kind of standard of RNN’s with other
simple structures local networks will be described in another context of associative memory
and so on. But it's outside the scope of this lecture to go in more details of all this I would say
genre of different approaches developed. So this was the end of this lecture thanks for your
attention, the next lecture 6.6 will be on the topic Hebbian learning and associated memory.
thank you

197
[Music]

Welcome to the number six lecture of the six long six week the course in machine learning. The
theme of this lecture is heavy learning an associative memory. So as you can see from the map of
the lectures this week we have now left the main stream of work on artificial neural networks the
feed-forward networks and recurrent neural networks and turn to another category of systems
which can best be characterized by the term associative memory but we will first turn to
something more specific which we call Hebbian learning. Hebbian learning theory is a
neuroscientific theory claiming that an increase in such synaptic efficacy arises from a
presynaptic cells repeated and persistent stimulation of a postsynaptic cell. This theory was
introduced by Donald Hebb in his 1949 book the Organizational Behavior. The theories also
called the Hebbs rule, Hebbs postulate or cell assembly theory. Hebb expressed himself as
follows, let us assume that the persistence or repetition of a reverberator activity tends to induce
lasting cellular changes that add to instability. When an axon of cell A is nearly enough to excite
a cell B and repeatedly or persistently takes part in firing in some growth process or metabolic
change takes place in one or both cells such that the efficiency of A as one of the cells firing B is
increased. Actually as it turns out neuron A has to fire slightly before B are not fully in parallel
to manifest the causality relation. So this elaboration of Hebbs's work is called spike timing-
dependent plasticity. What is Hebbian Learning in an Artificial Neural network? The theory is
often summarized as cells that “fire together wire together”. It's an attempt to explain synaptic
plasticity the adaption of brain urines during the learning process. In an ANN setting the
plasticity is implemented through adaption of weights. So Hebbs law can be represented in the
form of two rules. If two neurons on either side of a connection synapse or activated
synchronously then the weight of that connection is increased. If two neurons of either side of a
connection read synapse are activated asynchronously then the weight of that connection is
decreased. So Hebbs law provides the basis for unsupervised learning. Learning here is a local
phenomena occurring without any feedback from the environment. Let's turn now to what we
call the Hebbian learning algorithm. So first there's always an initialization so the synaptic
weights and the threshold is said to small random values in the interval 0 to 1. Then we have this
step of activations where we compute the postsynaptic neuron output from the presynaptic input
elements, denoted here as Xij from the data item Xj and j here denotes the number of the training
instance data item considered, while i is one of the elements of the this input vector. So actually

198
the output from the postsynaptic neuron Yj is 1, if the sum of the input times the weight on the
input connection minus T where T is the threshold is larger or equal to zero, if not Yj is zero. So
after that that the output value of the postsynaptic neuron is calculated, one can go to step three
which is the learning phase and the learning follows that something called the activity product
rule which captures the essence of the Hebbian theory. So here we update the weights in the
network and the weight correction is turned by this so-called activity product rule, which says
that the new weight that is going to apply the in iteration J plus 1 is equal to the old plus a term
which is alpha which is a learning rate parameter, as already comment in a kind of damping
factor that can be between 0 & 1, multiplied by the new calculated value of the post synaptic
neuron times the input value from the presynaptic input element. And after that we start again we
use a new data item and the same procedure is repeated.

Let us now look at an example of this algorithm. So we have a neuron for pre synaptic inputs x1
to x4. We have weights on the connections for those inputs, we have an output from the
postsynaptic neuron Y, we assume a threshold of 2 we assume a learning rate of 1 and we initiate
all the weights to uniformly to all of them to one and then we will now look at two training
instances one zero one zero one zero one zero and look at how the calculations are made. As you
can see here we have instantiated the inputs to the first elements of the first training instance, we
have the predefined weights what we do now is this first installation we call zero and then we
have a learning rate of one and we have a threshold of two, so the first thing is we do is we use
the first formula for Y and create the sum, so as you see here what we do is a sum the input
values with the appropriate weight values and then we subtract with the threshold nearly gives
zero but as this fulfills the first criteria, we can output one. So that's the first step. So now we
have an output value. So then we are in the position that we can calculate the weight update, so
we calculate the Delta weight which is the actually the learning rate times the value new value of
Y times the input values on the various input variables. So as you can see this gives us 1 0 1 0
and if we add that to the old weight we get the new weights 2 1 2 1. As the second data item for
the second iteration is identical to the first, the calculation looks very very very similar. We still
get an output a 1 we update the weights but what we can see and I think it is the only interesting
observation on this slide, is that in the cases where we have a synchronous activation, both of the
presynaptic and the postsynaptic equal to 1 in this case. Then we further strengthen the weight of
that connection, which is actually consistent with the hebbian theory. And I want to say a few

199
words about associative memory. In psychology associative memory is defined as the ability to
learn and remember the relationship between unrelated items. This could include for example
remembering the name of someone or the aroma of a particular perfume or any other sensory
impression. Associated memories declarative memory structure and often episodically based. A
normal associative memory tasks involve the testing processes on the recall of pairs of unrelated
items such as face-name, pairs but in the realm of human psychology associative memory is
obviously a very wide term. When we turn to artificial intelligence and machine learning the
term gets more precise. So in these two areas associative memory refers to a broad class of
memory structures with mechanisms for storage and recall that can handle general patterns and
pattern matching. Theoretically all kinds of structures and data types should be able to be
handled in the same system. There is also a clear coupling to the area of content addressable
memory so called CAM techniques which is a more classical core area of computer science. So
for a associative memory in the computer science setting on the one end of the spectrum there
are memorization of specific objective situations, and recall of these based on detail but still
partial or noisy descriptions. In the other end of the spectrum there are analogical reasoning
where structurally similar but domain unrelated patterns can be recalled. Domain ways you can
get something back that is from a different route but it's still in some structural way similar to
your query so to say. In the in the middle there are case based reasoning where separate patterns
can trigger recall of larger patterns. So central concepts in associative memory are similarity
measures, spatial or temporal, another technological aspects of the pattern space like valleys,
hills, basins. Optimality criterion aspects of the search page like local minima maxima attractors
etc. Ideally want to say that one wants an associative memory system that has as many stable
well separated local business as well as memories to store.

So we have two forms of associative memory. One is called a Auto associative memory and in
Auto associative memory could also be called auto association memory or auto association
networks. It's any type of memory that enables one to retrieve a more complete object
descriptions from a partial description. So in more technical terms the input and output factors
have exactly the same form so now obviously that Xi and Yi has the same form for the vectors X
and Y. So as you see an example to the right we have the partial descriptions of a particular
object but there are details missing or a noise, but what you can retrieve is the full description or

200
full picture of that object. So these concrete examples are typical or restoration of imagery like
this one or the restoration of speech fragments.

The other category of associative memory is called Hetero associative memory so here on the
other hand man can retrieve not only object description on the same form, but potentially also
wider range of patterns still satisfying some measure of similarity with respect to a partial
description. So in terms of vectors the input vector X and the output vector Y can have very
different forms. So as a an example related to the above as you we have as input a key and what
we can retrieve is the situation around the application of a key, which is actually part of a door
and a lock what were the key is applied. Let me say a few words about some key concepts here. I
will talk about three things mainly something called Attractors, something called Basin’s and
something called Bifurcations so an attractive attractor is a state toward which other states in the
region evolve in time similarly each attractor has a basin which is a surrounding region in state
space so the all trajectories starting in that region end up in the attractor sooner or later so you
can see that illustrated in three ways to the right three different kind of images that illustrate that
fact the basses belong to different attractors are separated by a narrow boundary which can have
a very irregular shape the appearance of such a boundary separator is called before occasion for
initial positions close to the boundary small fluctuations can push the system either into the one
or into the other basin and therefore either finally into either the one or the other attractor so
close to the boundary the system beams chaotically but inside the bath and it moves predictably
towards its attractive ideally as already been said one wants an associative memory system that
has many well separated basil's as one have one has memories to store so this was the end of the
this lecture thanks for your attention the next lecture six point seven will be on the topic octal
networks and Boltzmann machines thank you goodbye

201
[Music]

Welcome to lecture seven of the six week of the course in machine learning and in this lecture
we will talk about some examples of associative memory hopfield networks and Boltzmann
machine and a few other approaches. As you can see from the map of material this week we are
now here so the coming lecture will then after this we will turn to something called convolutional
neural networks before we finish this week. The Hopfield network is one approach to the
realization of associative memory. It is an instance of an the subcategory of Association memory
called a auto association memory which if you remember is a situation where you look at certain
objects you have a partial description of that kind of object you are interested in, what you do is
you as like in a query fashion good in this partial description and the system is enable to provide
you through the call the full description. So in a way this kind of system completes primarily
completes partial descriptions. It's not likely to give you anything more, anything wider,
anything outside the actual domain. So hopefully that work is also considered to be a recurrent
neural network even if it's not able to handle temporal sequences in the sense that in every
iteration, in every training for a hopfield network you in a way define statically the inputs to the
networks based on a particular data item. In RNN you're able to have a sequence of items that
constitute one single date item and being able to handle that in sequence, this is not the case with
Hopfield network. However in Hopfield network has states and it's also having cycles in the
network so because of that it's counted as a recurred neural network. However it's more natural I
would say two primary classify of your networks within the class of associative memories. It's a
one layer neural network in the sense that all units are input/output, so there are no hidden units
and there is no distinction actually between input and output. So you input values and then you
calculate new values which becomes the output. It's fully connected which means that all nodes
all neurons here neurons are connected with each other and the ways are symmetric, so going
from A to B has one weight and going from B to A has the same weight. The units are modeled
inspired by the Mcculloch and Pitt's neural model which is very simple with output values minus
one and plus one. There are two modes so either you can update if the state synchronously at the
same time or asynchronously in some sequences. The weight updating is inspired by the Hebbian
learning so it's clearly in the tradition of associative networks. It also has an energy concept
which is very important and that ensures the convergence towards the stationary state, and so
primarily when you search for stationary state you use the value of the energy variable as your

202
guideline and the models enables fixpoint stable attractors, which means which corresponds
actually to the local minima. And hopfield networks were invented by Hopfield in 1982.

Let us now look at the updating of a unit in a Hopfield network. So in a Hopfield network first of
all we have what we call one could call one-shot learning, that we in a way decide how many
paid items we would consider, let's say N of those which corresponds to one can say with a
memory terminology how many memories do we want to store in our associative memory. And
having decided that we will construct a matrix X with these data item vectors as rows and then
the weight matrix will be the transposition of that matrix multiplied by the matrix itself divided
by N. So I can say it's the normalized value of this matrix multiplication. Okay so that's the kind
of storage part and then we will look at a particular test case and in that test case, based on the
test case we will calculate a new value for each of the states and that is done according to the
formula but the new value of the stage is plus one, is the sum of all the inputs states from the
other nodes in the network times the weight on the connections from those. If that's bigger than
the threshold then we got a plus one otherwise we get a minus one. So the output series in the
traditional McCollum some pits (07:38) And so the values of neurons Y i and j will converge if
the weight between them is positive similar they will diverge is the way it is negative. So in that
sense we can say that this way of doing it is in the Hebbian tradition. So finally another aspect of
this is when you do the updates you can do them in two different ways, you can do it in
asynchronous way where you update one unit at a time and then you can pick them at random or
you can have a specific order and actually. And then there is the synchronous mode where all
units are updated at the same time and it somehow requires a central clock to the system in order
to maintain, so some people thinks this is less realistic but it because in nature and the
corresponding systems in humans and animals and so on, there is actually no global clock of that
kind but actually we are building a diffusion system, so maybe that's not a strong criticism.

Hopfield networks has one very important feature and that's the energy value. So it's this is a
scalar value associated with each state of the network and that is defined on this slide so you can
see it's the total sum of all combinations of all states and all the weights between them, divided
by minus 2, plus the sum of the states multiplied by the threshold. So I will not prove to the
inference of this formula, I will simply say that the setup of this formula called energy is that it
should ensure that this formula always will decrease over time the through the updates. So

203
repeated updating will eventually converge to a state which is a local minimum in an energy
function. So if the state is a local minimum for the any functions is a stable state for the network.
So actually what you can possibly do is that you build a practical hopfield network system and
then you can store a number of cases defining the network and then when you for every test case
you test, you can look at the energy value and then you can see how the energy value behaves
and of course the idea is that you want to minimize the energy value, the lower the energy value
the closer you are to a stable state.

Let us now quickly look at a few graphical examples. So first here you see in two neuron
hopfield network characterized on by two stable states, you can see an graphical descriptions of
the Basins for and the attractors for those stable states. If we move to three states you can also
see some graphical depiction of that and you can see in this particular case given the details it
turns out that this example have two stable states and finally if we move upwards it's not so
easily to depict you can have a fine neuron hopfield network with a more complex weight matrix
but of course you can also imagine a landscape actually defined by the error function in this case.
So of course the more complex the network is the more valuable the energy function becomes
because it's a simple scalar measure calculated from the system. So even if the system is complex
to view this color the energy value will always tell you something.

Let us now look at was more practical example. So we have a small board with four items on the
board or places on the board, we call them one two three four and so there are two stable and so
for each position there is either minus 1 or plus 1. So the orange or yellow here are positive and
the red one is negative and so if all positions on the board are in one of these state we can say it's
stable because those are the states that can be generated actually by the updating function. Of
course we can introduce externally other states and these are depicted us as blue here, so you can
see down here that these items are blue which means that they are neither minus 1 or plus 1. So
in this example if you can depict in a various ways you can depict the graph of course it's very
important to understand what the weights are and has been said earlier to the weight updating in
one shot so you look at the three patterns we have seen patterns about as you see and actually the
in the Hebbian tradition and the updating is based of the average correlations across the three
patterns. So if you locate for p1 and in in the column below, you can see that if you compare
state 1 we stayed two it's the same so it's a postitive, so you get a 1, if you compare it with the

204
element 3 it's also positive but if you compare it with element four, it's the opposite so that for it's
negative. So this kind of procedure you can go through for all these combinations of the first
pattern and now you do the same for the second pattern now you do the same third pattern and
then you get for each weight and you get a row and then you take the average of the elements of
the figures for the various pattern and then you get a column with averages and actually these
average are the initial weights we will use. One can also say that we'll see that soon a weight
matrix and that way it works is because it's a definition hopfield network at that everything is
symmetric, everything is connected, everything is symmetric, so this means and that that matrix
need to be mirrored in the diagonal and them and the figures we see here in this column will
constitute one of the house and the other half of the matrix will be the same and it's not rather
interesting what's in the diagonal bit because it's actually the relation from a node to node and
according to the model hopfield model there is no right cycle it's only in direct cycles in a
hopfield model. So this means that we can then calculate constructors in the matrix as I described
and then we can have a test case, so this means that we multiply this matrix with this vector and
when we do that we get the resulting matrix but the result according to the formula is not just the
multiplication of that matrix with the test case because it's also have to go through a threshold
function and in this case we have some make it simple and have the threshold 0. So doing that
threshold operation we will end up in a new value of one or minus one, that can then be depicted
as you see below with the stable state consisting of three positive and one negative in the bottom
right corner.

If we repeat the same calculation for the same example using the matrix notation we originally
introduced you can see that if we form a matrix with the original with the input vectors we
considered as rows, if we transpose that vector then we get actually our weight vector by
multiplying the transpose vector by the vector and dividing by N. So you can see here is that why
we form the matrix, the transpose matrix, we multiply them and we divide by N so we end up
with matrix to the to the right, and then we can use that matrix, so essentially when we want to
calculate an updated state we simply can multiply the new test case with that matrix to get the
result, by that we got a more schematic better scheme to easily calculate new states based on the
weight matrix.

205
Finally you can see here the alternative way of updating so in last in case we treated more in
detail we looked at this asynchronous case where all the rows were updated in in parallel of his
providing that there were some mechanisms of synchronized time. So here you cannot see an
example of the synchronous case where we have to take it stepwise. So first we update one now
we look at the result to be obtained another and so on and eventually that also will lead in the
bottom right to a stable State. So by this we take a little break and come back in part two in
another video to the same thing thank you

206
[Music]

Welcome back to the lecture on Hopfield networks and Boltzmann machine part 2. What we now
want to leave Hopfield networks as such and look at some related approaches.

So Boltzmann machines are also recurrent neural networks that can be seen as the stochastic
generative counterparts of Hopfield networks. The non determinism allows to come out of local
minima, so I would say that is the main feature here. Boltzman machines were invented in 1985
by Geoffrey Hinton and Terry Sejnowski, And they are also called stochastic Hopfield networks
with hidden units. So as you remember all units in Hopfield network were mutual input-output
nodes, so one could say Hopfield is a one layer neural network while the Boltzmann machines
may have hidden units. But as for Hopfield networks we have symmetric connections between
units. Finding optimal States involves relaxation, so now we talk about optimal, we don't I don't
talk about necessarily local optima, as many of the hill climbing or steepest descent approaches
that we looked at during the course, here we talk about a method that has some probability of
getting out of this local optima to find the global one. So in the Boltzman machine setup we have
introduced some kind of relaxation procedure where the student can see to say jump out or
potentially above the local minimum and this technology technique is called annealing,
simulated annealing. So Boltzman machine after the Boltzman distribution in statistical
mechanics which is used in their sampling function, they are also called energy based models.

So actually the state changes are occurred can either credited mystically if the Delta energy value
is less than zero or Stochastically with a certain probability and that probability normally is
defined as one divided when one plus e raised to minus Delta energy divided by a parameter T
and T is so-called temperature of the model, and one important idea we have in annealing we'll
come to that is that you start with the high temperature and decrease the temperature during the
process which of course and will affect the probability of making state moves.

Looking at the process in a Boltzman machine on this slide there is a simple scheme. So actually
we look at one example a time and for every sample we run the network in my two phases. So
the positive phase is defined by you innovate clamp or lock the visible units with the pattern
specified in the example and then let the network settle using this annealing technique and then
you record the opposite of units. And you can iterate that a number of times and then you do it

207
again but then you left the all the units free and you follow the same procedure you do in
annealing etc and then you compare the outcome of the two different phases and then one
compute the probably odds that both units the two units in each pair of unit is Co active and the
negative phase, so this gives us a set of probabilities and actually then what we do when we
update the weights we base it on some statistics, so we update them with respect to the
differences between the probabilities calculated from the two phases k is just learning if you've
seen it many times now is just the damping factor. So this is the procedure we will next go and
talk a little about the simulated annealing process.

So the source of inspiration for simulated annealing and actually annealing as it happens in
metals, which means that you initially heat the solid state metal to a very high temperature which
makes it possible for atoms to move around relatively free, but then you slowly cool the metal
down according to some schedule and so if you start very high and cool it down very slowly,
then you can achieve the atoms will place themselves in a pattern of response to the global
energy minimum of a perfect crystal. So of course you can see the analogy here what we want is
a optimized solution to our problem depending on how is modular or problem that means that we
want to find a global minimum. So that's the analogy we strive for. So one way of describing
some well assimilated annealing the steps in such an algorithm, would be the following so first
you initialize, you start with a random initial configuration, you initialize a very high temperature
value of that parameter and then you propose a change move or change of the configuration and
then you calculate some score in your case, in our case it's the Delta energy if we still keep to the
to that property of the Hopfield network that we inherit and so we calculate that Delta due to the
proposed move and then depending on the Delta, the move is accepted or not according to the
earlier given formula. And of course as you can understand depending on which step in the
iteration we have in each step we have a certain temperature in a beginning we have a high
temperature which may give a higher probability for making changes but when we then for the
next we update the temperature and we do it by lowering it in a in a certain pace and of course
when we lower the temperature by a for each iteration the likelihood that actually a move will be
taken diminish. So we continue then this until we came to what could be called a freezing point
and of course a freezing point need to be defined for each domain where we apply this in. This
slide just shows you the phenomena that in contrast to a Greedy algorithm the possibility to take
still take moves even if you if you reach the local minimum you can still take this tentative

208
moves that may bring you out of the minimum, and hopefully then in the end towards the global
minima weight you can see in the end at the bottom. One issue with Hopfield networks and
restrictive Boltzmann machine in practice is the connectivity, an unlimited connectivity of this
network. So therefore there are a number of approaches where one in a way different ways limits
the connectivity. So one such architecture is called restricted Boltzmann machine RBM and
where you decide that the neurons as a system must form a bipartite graph. So bipartite graph is a
graph where there cannot be any connections between the neurons or nodes within one of the one
of these parts, so the graph is divided into part and there is a group of neurons in each and
between neurons in one group there cannot be any connection, so it can be only be connection
between yours that are in two groups. So and actually the two groups are typically the visible the
visible neurons and units or the hidden units. This method is rather old but actually it get more
popular just ten years ago when the researchers together with Hinton developed some better
algorithms for this method and after that is come into more might use. Another related approach
is called bi-directional associative memory abbreviated BAM and the interesting difference here
is while Hopfield networks are auto associative which means if you remember that in Auto
associative memory you can intrinsically only retrieve objects or patterns that are more or less of
the same exactly the same form. So you enter a partial a partial description and you can get the
full part or format wise the same kind of object. While in hetero associative memory and BAM is
an instance of that, you can actually retrieve totally different patterns and the reason for that is
possible in BAM is that, BAM work with actually two separate layers and in one layer you can
have items on one format and then in the other layer you can have items with a totally different
format. The two layers are fully connected so this means that there is an unlimited way of really
connecting the two classes of patterns. So by having these layers where there are free format in
both, you could possibly live up to the demands of what is termed hetero associative memory.
We don't have much time to go into the details of BAM but on this side you can find a simple
example that illustrates somehow how things work, so actually what you do is your store
association between is pattern in this memory and you can also observe it up to the left that the
two patterns that we associate cannot totally different a different form. So actually what is
possible to do is to create a matrix from the sum of the products of these associated vectors and
then that matrix can be used so when you want to retrieve a pattern you can then in this case the
B the second part you can multiply that the vector and get the other part back, and if you do onto

209
the opposite you can use the transpose of this matrix. So there are no details here sorry for that
but it's just to give you a feeling for how this retrieval can be done and also the fact that you in
this case can have the association between totally different structures. Another interesting variant
of associative memory is what's called the Kohonen network or self-organizing maps SOM. So
SOM implements a form of unsupervised learning and in you know it takes essentially complex
nonlinear statistical relationships between our dimensional data and maps it onto a low
dimensional display, so actually one of the results here is to do dimensionality reduction. As this
lever compresses information while preserving the most important topological and metric
relationships of the primary data items on the display. It also may also be thought of give some
kind of abstractions. It was invented in the 1980s but by Teuvo Kohonen and the map is
predefined and usually relates to finite two-dimensional region where no nodes are arranged in a
regular grid and you'll get the idea from the right. And then you can see you have input vectors
below and then one can say that this model is fully connected in the sense that all inputs are
actually connected not to themselves among themselves but all the inputs are connected to all the
elements in this two-dimensional grid.

So actually the SOM network in the SOM network if one look at the nodes in in this in this flat
space of nodes apart from being connected to every input it's also that it node keeps a weight
factor where one weight for each connection and so that is built up during the training phase but
in the testing phase actually what the system does, is it for each test case tries to find the node
where with the weight vector that gives the smallest magical distance to the to the test case.

So by this we finish this lecture on Hopfield networks and related work, thank you very much for
your attention the next lecture six point eight will be on the topic of convolutional networks.

210
[Music]

Welcome back to the lecture on convolutional networks, we will start to look with how a typical
convolutional neural network architecture looks like. And you see it here, so you can see you
have the input and this is this pixel variant of an image and after that follows a number of layers
that is defined here as convolution and Pooling. So essentially it's not pre specified how many
such levels you should have it could be one level, one convolution one pooling on convolution
pooling but it could so that's up to the design of the network and essentially it depends a little on
how complex the image is, how many objects, how many features because essentially if you
want to model many features and you have many levels of abstraction that were to capture we
need more levels. So we will go into the details shortly I just give you the broad picture now and
so after that and one can say that that phase is supposed to do the feature learning, so the idea is
that the system after this these faces have captured the relevant features that you can see in the
image. So then the next step the final step is given that the features are articulated then we want
to analyze this, one typical task is to classify the potential objects that can be there given the
feature map and that part is more or less like the traditionally Neural networks that we started to
look at in the beginning of the week and the only thing we need to do first typically is because
the output of these first phases is normally a multi-dimensional matrix but as you may remember
for a traditional artificial neural network then we normally want one feature vector. So actually
and typically there is a stage in between called flatten when we take the multi dimension matrix
and create one dimensional vector and then based on that vector we do the traditional
classification. So this is more or less the architecture.

So there are a few terms that are important here when we go deeper into this different phases. So
convolution you already heard you've got the mathematical background there's also some
campaign called Filter, a lot of people talk about Kernels. I prefer personally filter because
Kernel is used in so many fashions as you remember from earlier lectures is a little confusing
actually, there is a concept called Stride, there's a concept called Padding all that have to do very
much how you how you handled this convolution layer. If the concept of Feature map and of
course that's the map of features at gradually is built up and so on, but we will go through all
these concept and systematically in the following slides. So we will start with what is most
characteristic for convolutional Network, this is the convolution and Pooling phases and these

211
phases together one can characterize as the feature learning phase. The feature learning phase is a
network consisting an arbitrary number of pairs of convolution and pooling layers, and the
number of rows of these pairs of layers are engineering decisions for typical problem settings but
in general later deeper levels handles more abstract or high-level features or patterns in analogy
with our assumed model of the functioning of the human visual cortex, that also have these
layers that systematically looks at more complex features. Yeah so now we will focus on what
happened in one layer and we assume that functionality will be the same independently how
many this pairs of layers we add. So always for this kind of tasks is very important to look at
what we start with so here we assume we got images with a certain format in this RGB model, so
in this case we assume we have an example that are 32 times 32 pixels and they have a 3 pixel
depth to handle the colors shades. Let us now look at the convolution layer. A convolution layer
as typically many filters that are going to be processed on the input image right in parallel most
typically in sequence. Now we will look at what happens when we apply one filter to the input
image, and as you understand what we do now is an I was a slightly abstract equivalent of what
you we did when we did these small examples of convolution because essentially the idea that
the way to regard what happens is that we have like the only engine we have two functions we
have the input the whole input image which is one entity we have one we have a filter and we are
going to view the input image from the perspective of that filter and map that onto another array
so you can say that the resulting the output from this face is a new array which somehow should
be considered the convolution of the input array we look at and the filter. And it's the filter in a
way that reshapes in the same way as the J function in the mathematical function the J function
really rare reshapes F so here the filter reshapes the input and gives us the output. So and of
course in every of these steps one has to choose a particular filter and a few things to think about
them so first of all but to think about the size and obviously filter is a sub area so we have a
rectangular input area and a filter should be some subset of that and normally we talk about
filters of a limited size they could be 5 times 5, it could be 5 minus 3 times 3 and so on. And the
idea is that also in an analogy with how we calculated the mathematical convolution, we should
let actually this filter slide across in a systematic fashion and across the whole input array, and so
let us disregard the color dimension for a moment I mean they have course had to be handled but
they just shouldn't disturb the main process for the moment. So let's forget them and let's assume
that every pixel has one value a number. So when we slide the filter across the total input image

212
we take the dot product between each filter element and each corresponding element of the sub
area of the input array that we are at the moment and then we get a scalar out of that with a
number of the one number out of that. And as you understand when we move a filter around
there are so many positions the filter can be in and normally depending on the size of the filter
there are the size of the filter decides how many position can be normally those position are
fewer than that what we started from. So when we handled one position of the filter we calculate
the number and we put that number in the new output array, and then we let the filter slide we do
the same computation for that position, we put it in the output etc. So when we have done that
systematically for all positive positions of the sliding then we have a complete feature output
map. And the normal thing is that we have a stride which is low which it means that when we
slide we take one step or two steps or three steps or stridor one we just move this filter window
consecutively but we couldn't have a bigger step which means that we will go more rapidly but
the first ever then also fewer filter measurements which will also then in lower the number of
elements in the into the feature map. So just to summarize then, what we have is a filter and
that's kind of a measurement from a certain perspective we applied we create this output matrix
and the size array and the size of the output array depends of course on the size of the filter and
the number the size of the this slice actually.

So let's look at another example maybe go through this because it's I think it's important to
understand this basic step because it's the heart of this method. So here is another example we
have a seven point seven array that gives us 49 elements, the filter is shown in the middle the
filter size is three and that is black and lines so now we move from the beginning we get rid of
the color schemes and the stride is one and this filter and this is a new thing which we didn't talk
about last slide, that of course there have to be a certain pattern in in the future because the
pattern in the feature this is the what defines the measurement we want to do. So in this case it's a
filter that is supposed to find diagonal patterns in the input image. So the output because of the
size of the filter and the stride in the same sense that in the earlier example we get a smaller
output array actually a 5 point 5 with 25 elements and here you can see I mean you can do it
yourselves so you can place the filter in such a position and then you make a paradise
multiplication do you will say that those modifications adds up to the figure four, so therefore
you get a four in out matrix. So this is what's going on and of course there are in a realistic
example the matrix would pixel is great and there are many many steps in this so it's always a

213
complex machine learning, but I hope I've given you the picture now that the core mechanisms
here are pretty straightforward and not difficult to understand.

There is another concept that you hit when you read about this kind of systems and that's the
concept of padding so I can wonder what that means. I mean the principle very simple, so you
have a image input array and depending on you who you choose your filter size and how and the
stride length it can be so that you have difficulties in an appropriate way by the sliding a cover
all aspects because it may be so that you cannot make such slides that every element it's taken
into account. So one way then is to actually extend the image frame with some dummy elements
on the border. Of Couse this is not informative because it's just put there to get some space but if
you do that then you may counteract this is the state of the affair that you cannot really measure
every corner of the image. It's more of an engineering decision from case to case whether you
think this is important, if it's necessary, if it's beneficial and so on. So there is no clear, it's a
context-dependent choice and that's the concept of padding that's really augmenting the image
frame image with the neck strap frame for the reasons I gave you.

As I already told you in each convolution layer there may be many filters that mean them maybe
which means there are many kind of measurements you want to make on in parallel on this input
image and these are handled separately, so there's the same process for each filter and each filter
produce an output array and as you can see from the picture on the right what we then get when
we are finished we'll go in through the old sets of filter it weighs like a sandwich of output
arrays, so we get a bigger array actually in the end. The pooling subsampling layer is a
complimentary step, so I said earlier normally the convolution step is always followed by
something step and the purpose of the pooling layer is to use the complexity of the array you
created. So because if you have many filters looking for many different kinds of features and that
will give you potentially a large array coming out from the convolution step and typically you
don't want to reduce that. There is an analogy here with what we discussed in an earlier week of
overfitting because if when we talk about decision tree we said that okay if decision tree
becomes too complicated if you killed so too many nodes, too many branches it's more likely
that it will do overfitting on the data it's going to test. And it's the same here there is some
parallel here that you'll get a to complex array structure here it it's also a risk to get the similar
phenomenon. So the pooling layer operates on each feature Maps also as we said out from the

214
convolution step comes portfolio or sandwich of these feature maps and then so in the pooling
layer you do the pooling operation on each feature map independently. Also here you use filters
but for another purpose somehow but actually you typically also here define a filter with a
certain size and then you move that filter around on the surface of the array you have at hand and
then of course you need to define what kind of matching should take place between that filter
and some sub-region of the area you're going to study and one very much used approach is to
take just pick the biggest value. So out of the out of the big array you apply the filter window and
then for the sub area you cover you pick that pixel with a with a largest figure and that figure is
what you take out as the result from that measurement and as in the same way as did with
convolution you move this filter around with a certain stride also so as you can see in the
example here to the right ,you have a four times four window and you have a with selective filter
of size two in both dimensions and we have a stride of two which means actually that there are
only four positions that this filter can have. So when we so then actually when we go around we
place the filter over which one of these four positions and if we have this max pooling approach
we just pick the biggest number in each position. An alternative approach is average pooling then
you take the average value in each little square and you have pop that result here. So the output
here and given this example is a two times two matrix instead over four times four. So that's also
pretty straightforward. There are two phenomenon or two aspects of the neuron structures used
in this networks for the convolution and pooling labels. So two phenomena is Weight sharing
and Local connectivity, and both these phenomenal aspects are actually attempts to fight the
complexity. So for example it's assumed that when you apply a filter you should apply it in more
or less in the same fashion doesn't matter where on a input array you apply it the behavior should
be the same, so therefore normally you make restrictions on the weight so you say for the
weights of the neurons involved should be the same doesn't matter whether the filters employed
here and here, neurons involved should keep the same weights in these cases. So therefore in a
normal artificial neural network the weights could be to all something (23:04) separate, so there
are no restrictions on connects a little memory (23:08) but in this case at least for these subsets of
weights for this subset of neurons you can imply the restriction that should be the same. So the
other thing local connectivity is also as you know that in a general ANN the neural connections
any neuron can be connected to any other neuron, but here because we have this layer structure
and also certain neurons applied in a certain filter then it's also normal that pretty strictly restrain

215
the connectivity among the neurons involved, so post these things together contribute to making
the system more efficient. When leaving the convolution and pooling layers and before entering
the fully connected layers the output of these previous layers is flattened. By this is meant that
the dimensions of the input array from early phases are flattened out to one large dimension. So
you can see a small example to the right but you can also have a bigger one you have a 3d array
with the shape of 10 times 10 times 10, so when flattened would become a 1d array with
thousand elements. So this is pretty straight forward. So now we have something that is easily
and normally it can be put as input to a kind of standard artificial neural network.

After the flatten we move to the fully connected layers. So in contrast to the earlier layers who
primarily did the feature extraction, the fully connected layers are supposed to do typically
classification or something of that kind. The fully connected layers takes as input the flattened
array represent the activation match of high level features from earlier arrays and outputs an N
dimensional vector typically. So N is the number of classes that the program has to choose from,
so for example if the task is digit classification and N would be ten since there are 10 digits. And
the fully connected layers determine which is best correlate to a particular class. So and you can
have different kinds of activation functions in input layer, you can have the ones we already
discussed but there's also something called Softmax activity function and that's particularly
useful for handling multiple class classification problem. So essentially what that kind of activity
function ensures is that the outcome, the output in this vector is a sum of numbers that equals to
1, so it's the probability it gives you a probability for whether the instance you look at, found an
instance in the image is a member of one of these classes so that can be pretty handy.

Finally here there are few comments on the activation functions that could be used, so many
times you use different kind of activation functions in different parts of this kind of complex
system. So if we look at state-of-the-art it's very usual too common to use something called
Rectified Linear Unit activation function in the convolution and pooling layers. It's a very
straightforward function and easy to handle, there are no negative things with your doing that.
For the for the output you can do different things, you can use what was already mentioned
Sigmoid functions, Hyperbolic functions if you have a binary classification problem. For
multiple class by that classification problem the Softmax function is very popular because you
get a normalized input in terms of probabilities for class membership. Sometimes people use also

216
Gaussian functions but that's not the most common, you can also have an identity activation
function which is recommended for regression problems.

Here we will we go through another example and it's also mentioned this example it's a pretty
well-known example of a CNN, actually one of the first real systematically carried out systems
of this kind. So it's from 1990, so it's pretty old given the use of this sub area and it's a system
where you were the purpose is to analyze handwritten machine printed characters actually, I
mean they are an image form so there are infinite pixels so and it's still an image recognition
problem. So this is rightly you can see here this is more of a repetition of course you can see that
there are two of these pairs of convolution and subsampling layers and then it's a full connection
area so there one flattening layer, two fully connected layers and finally there is Softmax
classifier applied.

So here you can see you also see clearly how this dimensions vary I mean this is of course a key
issue here to understand how the size of this array so how complex the problem you create. So
here it's a grayscale 32 times 32 grayscale image that's what you start from and then you are you
have a convolution layer with a filter of size 5 and stride of 1. So then out of that you get the
matrix 28 times 28 times 6. And then actually you apply a second layer where you reduce those
dimensions with two pooling layer. So after that, you have a different kind of convolution, you
have a different kind of pooling, so then in the end you are down to something that is 5 times 5
times 16 actually. And then you have two fully connected layers an output layer and in the there
is a mapping on the digits that is 0 to nine.

So you can see as a summary of the system you can see the different layers employed, you can
see the kind of feature maps produced the sizes the filter kernel is the equivalent synonym to
filter, so you can see the filter supplied you can see the strides, you can see the kind of activation
functions. So hyperbolic tangent is a variant of sigmoid. So obviously here in this very early
system it wasn't so popular to use this array very new function at that point so still it was the
fashion to use Sigma related functions. So this was an example again of the same scheme and
finally I will give you some timeline for what happens in this area as I said the pretty young area,
maybe one can say that LeNet the one example I showed you lately is more or less the point
where this really took off. There were a number of precursors in the period before and actually
this area really got famous only at the point where you've got more powerful hardware to support

217
this system because these systems are very complex and takes a lot of computing power so and
the existence and the knowledge to apply special hardware for running these systems made a big
change and I would say that the big success story is this 2012 system Alexnet by Alex
Krizhevsky. So that was convolutional neural networks, I hope you bought a fare picture it may
be not the simplest area but it's very important component in this whole realm of artificial neural
networks. So thank you for attention, at the end of this lecture the next lecture is 6.9 will and this
week and the lecture is about deep learning and some recent developments thank you.

218
[Music]

Welcome back to the lecture on convolutional networks, we will start to look with how a typical
convolutional neural network architecture looks like. And you see it here, so you can see you
have the input and this is this pixel variant of an image and after that follows a number of layers
that is defined here as convolution and Pooling. So essentially it's not pre specified how many
such levels you should have it could be one level, one convolution one pooling on convolution
pooling but it could so that's up to the design of the network and essentially it depends a little on
how complex the image is, how many objects, how many features because essentially if you
want to model many features and you have many levels of abstraction that were to capture we
need more levels. So we will go into the details shortly I just give you the broad picture now and
so after that and one can say that that phase is supposed to do the feature learning, so the idea is
that the system after this these faces have captured the relevant features that you can see in the
image. So then the next step the final step is given that the features are articulated then we want
to analyze this, one typical task is to classify the potential objects that can be there given the
feature map and that part is more or less like the traditionally Neural networks that we started to
look at in the beginning of the week and the only thing we need to do first typically is because
the output of these first phases is normally a multi-dimensional matrix but as you may remember
for a traditional artificial neural network then we normally want one feature vector. So actually
and typically there is a stage in between called flatten when we take the multi dimension matrix
and create one dimensional vector and then based on that vector we do the traditional
classification. So this is more or less the architecture.

So there are a few terms that are important here when we go deeper into this different phases. So
convolution you already heard you've got the mathematical background there's also some
campaign called Filter, a lot of people talk about Kernels. I prefer personally filter because
Kernel is used in so many fashions as you remember from earlier lectures is a little confusing
actually, there is a concept called Stride, there's a concept called Padding all that have to do very
much how you how you handled this convolution layer. If the concept of Feature map and of
course that's the map of features at gradually is built up and so on, but we will go through all
these concept and systematically in the following slides. So we will start with what is most
characteristic for convolutional Network, this is the convolution and Pooling phases and these

219
phases together one can characterize as the feature learning phase. The feature learning phase is a
network consisting an arbitrary number of pairs of convolution and pooling layers, and the
number of rows of these pairs of layers are engineering decisions for typical problem settings but
in general later deeper levels handles more abstract or high-level features or patterns in analogy
with our assumed model of the functioning of the human visual cortex, that also have these
layers that systematically looks at more complex features. Yeah so now we will focus on what
happened in one layer and we assume that functionality will be the same independently how
many this pairs of layers we add. So always for this kind of tasks is very important to look at
what we start with so here we assume we got images with a certain format in this RGB model, so
in this case we assume we have an example that are 32 times 32 pixels and they have a 3 pixel
depth to handle the colors shades. Let us now look at the convolution layer. A convolution layer
as typically many filters that are going to be processed on the input image right in parallel most
typically in sequence. Now we will look at what happens when we apply one filter to the input
image, and as you understand what we do now is an I was a slightly abstract equivalent of what
you we did when we did these small examples of convolution because essentially the idea that
the way to regard what happens is that we have like the only engine we have two functions we
have the input the whole input image which is one entity we have one we have a filter and we are
going to view the input image from the perspective of that filter and map that onto another array
so you can say that the resulting the output from this face is a new array which somehow should
be considered the convolution of the input array we look at and the filter. And it's the filter in a
way that reshapes in the same way as the J function in the mathematical function the J function
really rare reshapes F so here the filter reshapes the input and gives us the output. So and of
course in every of these steps one has to choose a particular filter and a few things to think about
them so first of all but to think about the size and obviously filter is a sub area so we have a
rectangular input area and a filter should be some subset of that and normally we talk about
filters of a limited size they could be 5 times 5, it could be 5 minus 3 times 3 and so on. And the
idea is that also in an analogy with how we calculated the mathematical convolution, we should
let actually this filter slide across in a systematic fashion and across the whole input array, and so
let us disregard the color dimension for a moment I mean they have course had to be handled but
they just shouldn't disturb the main process for the moment. So let's forget them and let's assume
that every pixel has one value a number. So when we slide the filter across the total input image

220
we take the dot product between each filter element and each corresponding element of the sub
area of the input array that we are at the moment and then we get a scalar out of that with a
number of the one number out of that. And as you understand when we move a filter around
there are so many positions the filter can be in and normally depending on the size of the filter
there are the size of the filter decides how many position can be normally those position are
fewer than that what we started from. So when we handled one position of the filter we calculate
the number and we put that number in the new output array, and then we let the filter slide we do
the same computation for that position, we put it in the output etc. So when we have done that
systematically for all positive positions of the sliding then we have a complete feature output
map. And the normal thing is that we have a stride which is low which it means that when we
slide we take one step or two steps or three steps or stridor one we just move this filter window
consecutively but we couldn't have a bigger step which means that we will go more rapidly but
the first ever then also fewer filter measurements which will also then in lower the number of
elements in the into the feature map. So just to summarize then, what we have is a filter and
that's kind of a measurement from a certain perspective we applied we create this output matrix
and the size array and the size of the output array depends of course on the size of the filter and
the number the size of the this slice actually.

So let's look at another example maybe go through this because it's I think it's important to
understand this basic step because it's the heart of this method. So here is another example we
have a seven point seven array that gives us 49 elements, the filter is shown in the middle the
filter size is three and that is black and lines so now we move from the beginning we get rid of
the color schemes and the stride is one and this filter and this is a new thing which we didn't talk
about last slide, that of course there have to be a certain pattern in in the future because the
pattern in the feature this is the what defines the measurement we want to do. So in this case it's a
filter that is supposed to find diagonal patterns in the input image. So the output because of the
size of the filter and the stride in the same sense that in the earlier example we get a smaller
output array actually a 5 point 5 with 25 elements and here you can see I mean you can do it
yourselves so you can place the filter in such a position and then you make a paradise
multiplication do you will say that those modifications adds up to the figure four, so therefore
you get a four in out matrix. So this is what's going on and of course there are in a realistic
example the matrix would pixel is great and there are many many steps in this so it's always a

221
complex machine learning, but I hope I've given you the picture now that the core mechanisms
here are pretty straightforward and not difficult to understand.

There is another concept that you hit when you read about this kind of systems and that's the
concept of padding so I can wonder what that means. I mean the principle very simple, so you
have a image input array and depending on you who you choose your filter size and how and the
stride length it can be so that you have difficulties in an appropriate way by the sliding a cover
all aspects because it may be so that you cannot make such slides that every element it's taken
into account. So one way then is to actually extend the image frame with some dummy elements
on the border. Of Couse this is not informative because it's just put there to get some space but if
you do that then you may counteract this is the state of the affair that you cannot really measure
every corner of the image. It's more of an engineering decision from case to case whether you
think this is important, if it's necessary, if it's beneficial and so on. So there is no clear, it's a
context-dependent choice and that's the concept of padding that's really augmenting the image
frame image with the neck strap frame for the reasons I gave you.

As I already told you in each convolution layer there may be many filters that mean them maybe
which means there are many kind of measurements you want to make on in parallel on this input
image and these are handled separately, so there's the same process for each filter and each filter
produce an output array and as you can see from the picture on the right what we then get when
we are finished we'll go in through the old sets of filter it weighs like a sandwich of output
arrays, so we get a bigger array actually in the end. The pooling subsampling layer is a
complimentary step, so I said earlier normally the convolution step is always followed by
something step and the purpose of the pooling layer is to use the complexity of the array you
created. So because if you have many filters looking for many different kinds of features and that
will give you potentially a large array coming out from the convolution step and typically you
don't want to reduce that. There is an analogy here with what we discussed in an earlier week of
overfitting because if when we talk about decision tree we said that okay if decision tree
becomes too complicated if you killed so too many nodes, too many branches it's more likely
that it will do overfitting on the data it's going to test. And it's the same here there is some
parallel here that you'll get a to complex array structure here it it's also a risk to get the similar
phenomenon. So the pooling layer operates on each feature Maps also as we said out from the

222
convolution step comes portfolio or sandwich of these feature maps and then so in the pooling
layer you do the pooling operation on each feature map independently. Also here you use filters
but for another purpose somehow but actually you typically also here define a filter with a
certain size and then you move that filter around on the surface of the array you have at hand and
then of course you need to define what kind of matching should take place between that filter
and some sub-region of the area you're going to study and one very much used approach is to
take just pick the biggest value. So out of the out of the big array you apply the filter window and
then for the sub area you cover you pick that pixel with a with a largest figure and that figure is
what you take out as the result from that measurement and as in the same way as did with
convolution you move this filter around with a certain stride also so as you can see in the
example here to the right ,you have a four times four window and you have a with selective filter
of size two in both dimensions and we have a stride of two which means actually that there are
only four positions that this filter can have. So when we so then actually when we go around we
place the filter over which one of these four positions and if we have this max pooling approach
we just pick the biggest number in each position. An alternative approach is average pooling then
you take the average value in each little square and you have pop that result here. So the output
here and given this example is a two times two matrix instead over four times four. So that's also
pretty straightforward. There are two phenomenon or two aspects of the neuron structures used
in this networks for the convolution and pooling labels. So two phenomena is Weight sharing
and Local connectivity, and both these phenomenal aspects are actually attempts to fight the
complexity. So for example it's assumed that when you apply a filter you should apply it in more
or less in the same fashion doesn't matter where on a input array you apply it the behavior should
be the same, so therefore normally you make restrictions on the weight so you say for the
weights of the neurons involved should be the same doesn't matter whether the filters employed
here and here, neurons involved should keep the same weights in these cases. So therefore in a
normal artificial neural network the weights could be to all something (23:04) separate, so there
are no restrictions on connects a little memory (23:08) but in this case at least for these subsets of
weights for this subset of neurons you can imply the restriction that should be the same. So the
other thing local connectivity is also as you know that in a general ANN the neural connections
any neuron can be connected to any other neuron, but here because we have this layer structure
and also certain neurons applied in a certain filter then it's also normal that pretty strictly restrain

223
the connectivity among the neurons involved, so post these things together contribute to making
the system more efficient. When leaving the convolution and pooling layers and before entering
the fully connected layers the output of these previous layers is flattened. By this is meant that
the dimensions of the input array from early phases are flattened out to one large dimension. So
you can see a small example to the right but you can also have a bigger one you have a 3d array
with the shape of 10 times 10 times 10, so when flattened would become a 1d array with
thousand elements. So this is pretty straight forward. So now we have something that is easily
and normally it can be put as input to a kind of standard artificial neural network.

After the flatten we move to the fully connected layers. So in contrast to the earlier layers who
primarily did the feature extraction, the fully connected layers are supposed to do typically
classification or something of that kind. The fully connected layers takes as input the flattened
array represent the activation match of high level features from earlier arrays and outputs an N
dimensional vector typically. So N is the number of classes that the program has to choose from,
so for example if the task is digit classification and N would be ten since there are 10 digits. And
the fully connected layers determine which is best correlate to a particular class. So and you can
have different kinds of activation functions in input layer, you can have the ones we already
discussed but there's also something called Softmax activity function and that's particularly
useful for handling multiple class classification problem. So essentially what that kind of activity
function ensures is that the outcome, the output in this vector is a sum of numbers that equals to
1, so it's the probability it gives you a probability for whether the instance you look at, found an
instance in the image is a member of one of these classes so that can be pretty handy.

Finally here there are few comments on the activation functions that could be used, so many
times you use different kind of activation functions in different parts of this kind of complex
system. So if we look at state-of-the-art it's very usual too common to use something called
Rectified Linear Unit activation function in the convolution and pooling layers. It's a very
straightforward function and easy to handle, there are no negative things with your doing that.
For the for the output you can do different things, you can use what was already mentioned
Sigmoid functions, Hyperbolic functions if you have a binary classification problem. For
multiple class by that classification problem the Softmax function is very popular because you
get a normalized input in terms of probabilities for class membership. Sometimes people use also

224
Gaussian functions but that's not the most common, you can also have an identity activation
function which is recommended for regression problems.

Here we will we go through another example and it's also mentioned this example it's a pretty
well-known example of a CNN, actually one of the first real systematically carried out systems
of this kind. So it's from 1990, so it's pretty old given the use of this sub area and it's a system
where you were the purpose is to analyze handwritten machine printed characters actually, I
mean they are an image form so there are infinite pixels so and it's still an image recognition
problem. So this is rightly you can see here this is more of a repetition of course you can see that
there are two of these pairs of convolution and subsampling layers and then it's a full connection
area so there one flattening layer, two fully connected layers and finally there is Softmax
classifier applied.

So here you can see you also see clearly how this dimensions vary I mean this is of course a key
issue here to understand how the size of this array so how complex the problem you create. So
here it's a grayscale 32 times 32 grayscale image that's what you start from and then you are you
have a convolution layer with a filter of size 5 and stride of 1. So then out of that you get the
matrix 28 times 28 times 6. And then actually you apply a second layer where you reduce those
dimensions with two pooling layer. So after that, you have a different kind of convolution, you
have a different kind of pooling, so then in the end you are down to something that is 5 times 5
times 16 actually. And then you have two fully connected layers an output layer and in the there
is a mapping on the digits that is 0 to nine.

So you can see as a summary of the system you can see the different layers employed, you can
see the kind of feature maps produced the sizes the filter kernel is the equivalent synonym to
filter, so you can see the filter supplied you can see the strides, you can see the kind of activation
functions. So hyperbolic tangent is a variant of sigmoid. So obviously here in this very early
system it wasn't so popular to use this array very new function at that point so still it was the
fashion to use Sigma related functions. So this was an example again of the same scheme and
finally I will give you some timeline for what happens in this area as I said the pretty young area,
maybe one can say that LeNet the one example I showed you lately is more or less the point
where this really took off. There were a number of precursors in the period before and actually
this area really got famous only at the point where you've got more powerful hardware to support

225
this system because these systems are very complex and takes a lot of computing power so and
the existence and the knowledge to apply special hardware for running these systems made a big
change and I would say that the big success story is this 2012 system Alexnet by Alex
Krizhevsky. So that was convolutional neural networks, I hope you bought a fare picture it may
be not the simplest area but it's very important component in this whole realm of artificial neural
networks. So thank you for attention, at the end of this lecture the next lecture is 6.9 will and this
week and the lecture is about deep learning and some recent developments thank you.

226
[Music]

Welcome to the ninth lecture of the six week of the machine learning course. This lecture will be
about deep learning and it's further developments. So the question I want to pose here is the
following. Deep learning is a very hot topic at the moment if you open any paper, look at TV and
so on, it's a high probability that somebody mentioned this term. So what is it and if I try to tell
you what it is will it be a climax or an anticlimax for you. The are answer there I would say is
that result wise it's a climax because what it's normally referred to as deep learning today has
produced more success stories than other parts of machine learning so far. However from the
purpose of this course and rhetorically it's an anti-climax because we have this week already
talked about almost everything that is of relevance for the concept of deep learning in my
opinion. So for this final lecture on deep learning is not much more time because the constituents
we have already covered. Even if I just told you that there is not much more to be said about
deep learning than what we already covered this week, I like to give you a few facts about the
timeline of artificial neural networks and deep learning for that sake. Some of these facts have
already been mentioned but I have now collected them on one slide to give you my picture of
how things as developed. So let's start from the beginning, so in 1959 Arthur Samuel coined the
term Machine Learning, so since the beginning of artificial intelligence machine learning has
developed hand in hand in other sub areas of artificial learning, it's been a very much integrated
Development. Also very early that were pioneering work on artificial neural networks as you
have heard this week, for example the perceptron. But artificial neuron network has always
formed a kind of separate stream in artificial intelligence so not always it's been a tight
integration of the symbolic approaches in the artificial neural network. And this dream has been
weaker or stronger depending on the time, it was very strong in the beginning like people weren't
very optimistic but then as you see from the timeline here it was a long break from early 1960s to
1986 it's almost like it's like in the order of 25 years where very little happen for various reasons.
But in the mid nineteen eighties things happen so on one hand there were very important work
by Rumelhart and Hinton when they reintroduce multi-layer feed-forward networks and learning
of those networks through back propagation. And also not much later Rumelhart took the
initiative to the precursors of recursive neural networks such then developed hand-in-hand with
artificial neural networks in general and also not much later from in the late 80s there was this
the introduction of convolution neural networks in particular for image processing. So in the later

227
half of the 80s the key ingredients of deep learning what is now called deep learning was
developed. A little ironically the owner at the same time a researcher in machine learning Rina
Dechter in one article introduced the term deep learning for multi-layered symbolic machine not
in the context of artificial neural network. So at this point people like Rumelhart rather
introduced the term connectionist learning for his improvements of artificial neural Networks,
and published a well-known book on with that title. So actually it took considerable time until in
in around 2000 the term deep learning started to be used for this combination of ANN, RNN and
CNN by one researcher Igor and his colleagues. So around 2000 this terminology in a way what
was adopted by the ANN community of course it's a good terminology because if you have
multiple layer networks and you have many layers is natural to call them deep, and as you have
understood this week especially recurrent neural networks as the effect that the number of layers
grow and that is also of course the case in convolutional neural network of approaches. So all
these things tend to make the number of layers larger but then there is another development that
is very important because of course you may in theory create very powerful systems by creating
systems with many layers which was well done through the combination in ANN, RNN and
CNN. Okay but those complex networks also demands a lot of computation power and that
computational power was essentially not there at that point in the beginning. So then in the first
decade of the 2000s people started to experiment with using specialized hardware GPUs and we
will come back to the use of these things in the coming week, so the application of those
Hardware made it possible to run this more huge networks in an efficient way. So this
development happened in the first detailed twenties made this combination of ANN, RNN and
CNN that we know from now termed deep learning more useful. And an example of the
usefulness happen actually in 2012 where one such system actually a deep learning system in the
sense of combining ANN, CNN & RNN made a great breakthrough in this Imagenet challenge
that I described earlier this week with the system called Alexnet. So that was one of the first real
success stories and of course the term deep learning has been adopted in the decade before so
with a breakthrough also the use of this term spread and actually today 2019 the term deep
learning is a dominating term for the work of artificial neural networks in machine learning so
this is my story to you. To convince you that the story I told you on the last slide is not just my
invention, I will now will point on a really good survey article and I given you the link it's open
access so you can click on this link and you can find this article. And it's a review article from

228
nature volume 521 in 2015. This review article is are in the region by three of the key researchers
so Geoffrey Hinton you can call him I would say the still active key person in artificial neural
network as such and also more specifically in what's now called deep learning. Yoshua Benglo
who has really continued to work of Hinton and developed it further. Yann LeCun who is a key
person in the development of convolutional neural networks. So this is a kind of what you find in
this our article is I would say the words from the horse month it's not what anybody else
imagined or suggested it's the key person of this area have wrote this. Of course the key parts are
not always objective totally because these were the persons coined the term deep learning so of
course they want to promote it but they are very serious researchers, they are very instrumental
in developing the field so you should listen to anybody you should listen to them actually. So
what you find on this slide is actually the quotation from the summary of that article where they
say the following so they say that deep learning allows computational models that are composed
of multiple processing layers which is rather trivial statement start with but with the important
property to learn representation of data with multiple levels of abstraction. So and this should be
interpreted that these systems are able to themselves build up the representations not only use
precomposed representations. They also say that these methods have been dramatically improved
the state of the art in speech recognition, visual object recognition which are maybe the two very
important tests errors, doesn't mean that these technology cannot be use the others it's as if these
two have been the focal point of the development of this area. But today there are other things
that are also important drug discovery in genomics and so on which are very important. So what
deep learning does it discovers integral structure in large datasets by using back propagation
algorithm to indicate how a machine should change these internal parameters that are used to
compute representation in each layer from representation in the previous. So this little paragraph
and in their summary that really refers to the key machinery of feed-forward multiple layer
neural networks with learning through back propagation so this is a key point here in what they
think is deep learning. And then there is a final paragraph that articulates the two other
components, some deep convolutional nets have brought about breakthroughs in processing
images videos speech and audio, with recurrent networks have some light on sequential data
such as text and speech, which were the two other components in the lectures this week. So as a
summary to the to the right you can see my own little graph on what constitutes deep learning, is
actually the combination of the basic machinery in feed forward multiple layer artificial neural

229
networks ANN with backpropagation with the developments in convolutional neural networks
and recurrent neural networks. So these are the triples of elements that constitute this area.
Summarizing might attempt to characterize deep learning, I give you my preferred terminology.
So computer science is a broad field and it has a subfield called artificial intelligence and
artificial intelligence in turn as a subfield called machine learning. As part of machine learning
we have this combination of ANN, RNN and CNN that we took some efforts to talk about this
week and this combination is what is should be deducted as deep learning I also included there's
a little bigger framework relating to machine learning so actually machine learning is today
considered part of this umbrella concept data science which actually also cover other areas of
computer science outside artificial intelligence like Big Data and actually also other areas outside
computers science and of course centrally statistics, so this is the big picture I want to convey to
you. Finally a few words then about the recent development in this area of deep learning so one
important stream of research going on and development going on at the moment is going further
on what was highlighted in this summary of the deep learning article I refer to, is the learning of
structures and features in a presentation so the automated learning of structures and features and
this is like on the top of the traditional categories of machine learning, it's supervised learning
unsupervised learning, and feature learning, you can all do that without learning any new
structures because you can predefine the structural features but the ambition here is the learn
structures and features themselves. So this seems to be still a very important endeavor in deep
learning not in all necessary in all other parts of a machine learning, but there is this inherent
problem with ANN is that it's at the bottom sub symbolic which means that if you create a very
complex artificial neural network system and let it learn over a long time you build lot of
structures and so far is not being trivial to understand what is build up, you establish some
knowledge in this system and you can use the system for solving problems and you can do that
efficiently. However it's not trivial to say actually what has been learned what is the new
knowledge created in the system that enables better performance. So and of course people have
realized that we should useless in large scales, it's also important that not more creating efficient
systems will better performance but also being able to in a symbolic form extract what has been
learned in a way that we as human can read it and understand it. So there are a lot of work I
would say at the moment on that particular issue, so one of these new terms coming up is
Disentangle representations which is not the whole story but part of these attempts. The other

230
things going on because I hope you have noted that in my characterization of deep learning
associative memory didn't play a big role. So I talked about this combination of a ANN, RNN
and CNN which are the key deep learning still associative memory techniques is an important
stronger (17:13) within the neural networks as such but obviously traditionally not the Box been
(17:16) driven development of deep learning. But today there are a number of attempts to
actually integrate these areas, so integration of associative memory approaches with this
mainstream over ANN, RNN and CNN techniques is an issue key issue and there are examples
of that and something called Deep Belief network as an example. Also there are some work to
connect the ANN, RNN and CNN technology to the learning of Bayesian system Bayesian
network related systems. There is also this terminology that we haven't really talked so much
about there is this distinction in machine learning and related fields between generative and
discriminative classification, so in discriminative classification we start from the purely inductive
approach so we look at the examples and we'll try to infer which abstractions or classes connect
to the various examples we have while in a generative approach we actually starts from our
hypothesis or classes from that partly I would say through deductions come up with probabilities
for which kind of examples would belong to the classes and of course it had to be validated by
looking at the examples but it's more of a top-down approach and actually now there are more
focus at the moment I would say on developing this generative approaches then the
discriminative ones which were the classical. So also we can also see an enhanced use of the
ANN, RNN and CNN a bunch of techniques for the purpose of reinforcement learning. So I hope
you it's been clear that all work in artificial neural network is kind of orthogonal to these genres
in machine learning supervised, unsupervised reinforced, I mean this can be used for any of
these, so it's not so that ANN is particularly useful in one of the song (20:01)can be used for
depends on what you want and there are more work going on now in using this technique for
reinforcement learning. Another trend at the moment is scaling up of applications in time critical
and safety critical applications like self-driving vehicles. I mean self-flying vehicles or similar
things is a very hot technological area at the moment, machine learning systems are needed in
those contexts but putting a machine learning system into such contexts expose you to issues of
elective time criticality and safety criticality. And it's not evident that of all the solutions of
machine learning are already to cope with those challenges. Also we can see here a shift towards
really being an able to handle really Big Data, I mean traditionally machine learning started with

231
small data sets I think for a long time they've been developing managing larger data set but still
today we can see we have huge data sets, we are continuously generation of data from many
sources and if machine learning or deep learning should be able to be useful it also have to
extend its ability to handle these larger data sets. Also of course there is a continuous work on
utilizing special hardware this is they're just here to stay. A self-driving car has a special GPU on
board that can be used for machine learning. So things go on here and a new hardware
architectures are trying and this we will see for a long time be a development area. Finally there
is also a consolidation and enhanced open access to toolboxes and software support systems. For
some time now I mean the support for an engineer that want to really apply deep learning as it
has increased a lot but there are still improvements to be made and this is also an active. So this
is my summary what's going on in this area. So by this I will conclude this lecture and I will
conclude this week on our artificial neural networks. So the only thing remaining is the last
lecture which has always be a tutorial on the assignments for this week so thank you very much.

232
[Music]

Welcome to this tenth lecture last lecture for week six and here we will talk about the
assignments for the week and some recommendations for further reading. Actually the I changed
the setup for the assignments this week I got the feedback that maybe this distinction between
questions for video lectures and question for further material is not so easy or no sub meaningful,
so I left that so actually now I will go through just roughly the kind of question, assignments I
have designed for the video lecture, there will only be assignments for the video lectures. You
will get some information for further readings but they will mostly be there because it may be
facilitate your learning to have something on top of what's in the videos, but it's not required that
you read everything of that in order to answer some questions or do some assignments. Actually
I can also say that at this point that for the final exam I will I've thought about how to prepare
you for that, so actually I reallocated some material so for the 1/2 I would say of the last week of
this course, I will have a repetition session where I will systematize and collect all the problem
oriented tasks that occurred on all the weeks and go through that with you and actually all
problems solving tasks on the final exam will be within the realm of what is set up in this
repetition, so actually one can say also that you will only get problem solving tasks based on
what has been covered in the video lectures. I will also go through all the material recommended
for all week and make a selection from that and recommend you a subset of that and then it can
happen that you get recall questions based on those readings so this is my current setup for your
final exam.
I will now shortly comment on the different questions and as for few of the earlier weeks, I try to
characterize the question in terms of which lecture or sub area that they relate to, so there will be
two questions on the first lecture on fundamentals, I would say they are more or less of a recall
character so let's leave them, the next question will be related to the lecture on perceptrons and
actually the questions relate to the weight updating mechanism on the perceptron and as you I
hope you remember and this question very closely a couple or two examples handled on the
lecture. So question four is pretty similar, it is just there to highlight the fact that the machinery
for a single neuron in general feed-forward network is not identical and to the one in the person
of the perceptron Machinery is simple so the outcome of a typical excitation of a neuron of those
cases become different, but also here your task is very much related on what was covered on the

233
lecture. And similarly the question 5 is related to the learning mechanism of backpropagation on
the lecture of feed forward multi-layer networks also related to the key adaption of the weights,
you may think these questions are pretty nitty-gritty but the intention with having this question is
to force your little to really look in details how weight updating takes place which is the key of
learning in this area. So after that we turn to recurrent neural networks and first you will just get
the question on the relation to some applications for the area, and a few more recall questions
you may remember that the vanishing gradient problem is an important problem for this kind of
network, so it's a key issue how to handle that and which of course indirectly which of the
different approaches particular target that is use. The one version of recurrent network is the
LSTM which is very much used model today in practice, so I also want you to look a little more
it's not an easy area, I hope you remember that from lecture but I really want you to look at more
thoroughly on the detail mechanism there so this this question is there to trigger your interest in
LSTM. Then we leave that and go to associative memory, so on the sum as the other questions
the focus on the assignment here is on the updating of weights but in that different fashion than
what you find in the feed-forward networks and there is a follow-up question on associative
memory or more general character and then in the end we go to Hopfield networks and also there
you can find an exercise of updating character. Finally we have a few questions on convolutional
Network first one on what convolution is actually which is good to know if you're going to
understand that area and learning question writing to the some ingredients of this method and
finally one question regarding the source of the inspiration. So these are the questions for this
week.
We will now turn to the further readings and essentially the readings here are our little mix of old
and new, there are a few original references actually some of the key works are McCulloch, Pitt,
Hebb, Hubel, Rosenblatt etc., they're also some newer work describing giving an overview or
current work in deep learning but also on the specifically on long-short term memory and
Hopfield networks. So but as I said earlier these readings are there for your service and you don't
actually need to read them to answer the assignments for the week however they may help you in
actually some a subset of these will then also in the current in the final collection of readings for
the exam where you may have to answer recall questions for this set. This was the end of this
final lecture for weeks 6. Thanks for your attention the next week of the course will have the

234
following in combined theme, on one hand tools for implementation and on the other hand some
comments on interdisciplinary inspiration for the machine learning field thank you.

235
[Music]

Welcome to this seventh week of the course in machine learning. This week we will have two
separate themes, the first lecture will be on tools and resources and the second lecture will be on
interdisciplinary inspiration for the field of machine learning. So we will start with this lecture
and talk about a number of tools and resources that can be used for the implementation of
machine learning systems. This week has a number of sub themes and I will shortly go through
them with you now. So the first theme is about industrial contributions to the area and roughly
one can say that from the late 1950s to 2010 machine learning was more or less a research area
in artificial intelligence and the bulk of results came from the research groups. There was a
modest industrial instance over time but what happened in 2010 was more like an explosion of
interest and because this change of interest was so dramatical I want to talk specifically about
this and how the various companies are engaged from that point onwards. Having said that we
will start with I would say with a natural and simple corner (02:07). Machine learning is about
algorithms I hope you have understood that during this six week we have to be together,
algorithms working on representations then essentially this is the classical scenario in computer
science, one talk about algorithms and data structures and the next step is implementation, and
implementation typically happens through programming in a programming language that's the
classic scenario. So it's very natural start this discussion with this scenario but you want to build
a machine learning and you're more or less program your system from scratch in a particular
programming language. The next theme is about the excess of data and also about some general
resources for computing, so if you want to do with any kind of serious machine learning you
need a good supply and a large supply of the kind of data you need to work, and you also need
reasonable resources for computation, for the execution of your algorithms. And happy enough
not seems long but slightly longer than this explosive interest in machine learning most of the big
software companies have put a lot of effort in develop them software support for what is now
typically called cloud computing. So there are a lot of tools to support systems around for both
getting services from the so called cloud for having helped for development and for also doing
computations. So it's appropriate also to bring that in the picture here pretty early. The next step
sub team number four, is about what I call here software development enablers, that could be
libraries, there could be tools, and different kinds it could be API's etc so I will discuss that and I
will go through a number of examples of such software tools existing at the moment. The fifth is

236
about a category of systems called technical computing systems where Mathematica is such a
system and of course it's also clearly possible to implement a machine learning system in terms
of such tools. The sixth theme is about enhanced hardware support more powerful computers
architectures more powerful processors dedicated to this kind of tasks. And the final theme for
this week is about datasets and repositories for datasets because that's also a crucial component if
you want to do applied and series work and in this field.

Let's start with a short discussion about the industrial interest in the area. So let's talk about
companies with substantial research and development with this focus. As I said traditionally
development of techniques in this area has been driven by research groups at university research
each with a few exceptions. Companies like IBM actually have been very interested on a
moderate level for a very long time so there may be a few other such example but not many. In
the 1990s there was an increased interest in machine learning so the area of data mining was
formed and data mining was primarily very applied and driven by industry, however the
magnitude and the size of work in data mining at that time was I would say on a normal level,
considering that machine learning already done was a serious technical area. However what has
happened since 2010 is something different. It's a clear trend shift and it's an explosion of interest
in this area and not so many years have gone since that that time. So what happened is that
suddenly almost all influential software companies but also more and more hardware oriented
companies have made large investments in building up competence in artificial intelligence and
machine learning. Also officially stating their belief that this competence with subsequent effects
on their products and services will be pivotal for the further successes and competitor's of this
company. So not only making this silently in the background but rather very cleanly and
efficiently stating this is as a way I keep the main way forward. Yeah also apart from large
companies one can have observed during the same period an extensive set of successful startups
coming up. Some of them grown to be independent companies but also many of them were doing
in a moderate sized doing profitable exits when acquired by the competing industrial giants who
want to really expand that territory. So why did this happen? So it's not easy to tell but you can at
least you can mention I can mention a few contributing factors. So just before this happened we
could see some really clear success stories for this field specifically related to improvements or
performance in image processing and in speech processing. Of you also we could see an
awareness of the possibility of optimizing performance because performance have been a huge

237
problem historically for this area specifically for the field of neural networks. So essentially one
could also say that neural network have benefited most as a subarea due to the extended presence
and availability of better hardware support. Also there are some areas that has been growing and
been developing and during this period and is still developing and whether prognosis are also
very good for further expansion, and that are those areas are robotics and autonomous vehicles.
Also we in a broader sense we can see that the upcoming presence of cyber-physical systems
where we're more and more tied together digital systems, cyber systems with physical systems
and very close to that the concept of Internet of Things. So because we got very much more of
this interwoven systems both of Hardware and software, the impact of AI and machine learning
algorithms are scaled up. We can have one algorithm that existed for 30 years but 30 years ago
the overall technological presence in various sectors of society was not so big, so therefore the
effect of applying this algorithm is smaller but today the impact has grown. So just to sum up the
industrial interest of course had a big effect, it has an effect on research of development but it
particularly also have a very strong effect of the availability of tools or resources because if you
want to have extensive development of this kind of system in industry, industry needs tools and
resources and a lot of resources have to be put into that development.

Let us now look at some of these companies that has been engaged in this area since roughly
2012. And up the top you have Google. So Google has worked very hard to establish themselves
as a key player on this arena, So even since 2010 they have a deep learning AI research team for
Google brain, they also recently in 2017 announced a whole new division of the company
dedicated solely to artificial intelligence and finally the alphabet holding company for Google
has also bought and taking over smaller companies one example of an important company being
taking over is the deep mind technology company also focused in 2010 that have shown
immense success in various kind of game playing applications. So this is Google. Yeah another
company has also been very early on this but a little later than Google, is Amazon. So initially
the focus was very much on the improvements of Alexa, Amazon language assistant system
which they also worked hard to integrate into their eco speaker series system but soon they also
looked more and more on to their what they call the cloud platform of Amazon Web Services
and put a lot of efforts of augmenting that platform with machine learning components. So
Amazon have a lot of power obviously and even if they started a little later they are an important
player in this game. IBM has been very active since the 1950s so they are an exception they've

238
been active for a long time and they are still active, and they created a pretty famous system
called IBM Watson was originally more of a question answering systems, problem style solving
type but what they have done now is also more and more created software tools that various
kinds of user can access sub functionalities of the original Watson system and they also seen to
that these kind of functionalities are available into also the IBM cloud services. We should not
forget Microsoft also had a well-known and very much use cloud service called Azure and also
here Microsoft have worked hard to use machine learning components and of course also
Microsoft has been interesting in having their own to digital assistant system in this way called
Cortana and Microsoft also tried to strengthen their position here also by purchasing a lot of
smaller AI companies, 5 companies only in 2018. Facebook is also engaged has an AI research
group called FAIR and they recently recruited top people specifically from neural network like
Yann LeCun which you heard about in earlier lectures and they also developed and support
competitive software packages - for example Google. Also a European actor SAP are engaged
they also have a cloud platform which they in the same fashion developed in the machine
learning direction. Finally Apple has been very late obviously in this but as it seems they try also
now to get into this at this game. They're also hardware companies that that are important but are
will take them up later in the lecture.

Let's start with a scenario where you want to develop a machine learning system from start from
scratch in general programming language the classical scenario. So which programming
language should you choose. Two languages that been used a lot in artificial intelligence for a
long time, Lisp and Prolog typical representatives for the functional style of programming and
the logical programs that style of programs. However if we look at the situation today for
industrial machine learning applications these languages are not a strong rule. Actually for real-
life machine learning systems we can observe the dominance of languages that support the
object-oriented paradigm. During the last twenty five years I would say the triad of languages C
or indeed (19:33) with C++ to become Oregon (19:39) Java and finally Python have dominated
to scene, however over this period changing their relative top position as indicated on the slide
where you can see that originally C, C++ was very dominating over time, Java became a more
dominant and finally at this moment the most popular and most widely used general
programming languages Python. And then there may be many reasons for that I mean python is a
really good language so of course that can explain don't think it's a very clear and simple and

239
straightforward language but also it's the fact that python is a language that easily lend itself to
programming in many paradigm. So probably for machine learning and the future lies in multi-
paradigm languages because very much of the history because payment of the history as I
already told were lying in the in functional programming logic programming so of course it's
very nice if you have a language that could be efficient like C but not only imperative also
object-oriented functional logic oriented. So this is some kind of explanation where things are
moving. There are also other general languages that are coming up, once of language is the R
language where there is a very strong specialization in the language towards statistical
computing. One can also say that for certain applications where efficiency and real-time issues
are still very important, this combination of C and C++ still serve us a strong position. So let's
now go and look at through a list of this kind of languages. So as you see here currently Python
as at the top. Most of these languages are open-source this means that you can access this
language you may have a license but it would not cost you anything and many times the
restriction what you can do or cannot do is pretty liberal. Second place at the moment still comes
Java and third place as already C/C++. Also in the kind of if you look at top group for popularity
this language it is included. Then there are variety of other languages that are in this ballpark
(23:09) so you have languages JavaScript and which is also popular the of languages Scala used
in some context, you are other special languages like Julia coming up but as you see at the
bottom here Prolog and Lisp are still there but not very much used in practice. As I already said
machine learning is not only about algorithms. Very essential aspects are the access to relevant
data and the access to necessary computational resources. So as a consequence the natural basis
for machine learning systems are the system's for distributed cloud computing provided by at this
moment all majors of related companies and the common term here are cloud systems, cloud
platforms and essentially what you can get out from the cloud is as a end user you can get
services, this means that you can come this all the distance(24:43), you can acquire services you
don't have to worry about how these services are realized, so you only have an end-user
experience so this is what people in this field abbreviated SAAS, software as a service. And the
next step which is not so much for end-users but for developers that people talk about PAAS
which are platform as a service. So not only specific services are provided but of platforms for
development of new applications are provided via the cloud system. And finally you have what
is called abbreviated as IAAS which is that you can also get access to more or less infinity

240
infrastructures, computational and data storage resources via the cloud the system. So I think the
trouble of taking this up here this is not machine learning this is distribute computing, this is
cloud computing. However it's a very very clear tendency and I already touched it a number of
times in this lecture that almost all of these cloud platforms are now or from last few year's and
presently augmented by components that supports machine building development of machine
learning systems.

So look at the various frequently used cloud computing platforms, the same both players occur. I
guess everybody has heard about the iCloud but also Microsoft has a very ambitious cloud
computing system call Azur, Amazon Web Services in a similar way and SAP Leonardo so all
these big software companies have their own solutions here, but they're also as you can see
below some more clearly open source oriented even if some of them is also now our open source
even if they are developed and promoted by these companies there are also these kind of
solutions that originally also come from a purely open source environment. An example of that is
our system related to the foundation of Spark which is open-source software foundation but also
today there are new alliances form actually, so for example and this Hadoop System often occurs
in the context of collaborations with IBM. So this is an arena and yeah in a minute you will see
also more couplings coming up to the extensions to machine learning.

Just recently I started with talking about developing machine learning systems from scratch in a
general programming language. Actually this is today not the typical thing it is not only for
machine learning for any kind of software development the typical thing is not to develop any
code from scratch. Typically what you do is you combine existing, I would say atom or
molecules of software code from libraries that already are program and tested and developed to
perform some functionality to the task of programming today is rather combining pre-existing
modules, of course you have to program somewhat typically I think it's rare than you can only
build something from this pieces and put them together the typical thing is that some
programming is still needed, but the bulk of the work and a bulk of the system functionality
comes from the combined functionality of the atomic block blocks. And today therefore there is
a whole range of what are call here software development enablers of various kinds named with
a lot of different names. But a pretty straightforward though to start with something his software
library so software library and a set of consists of subroutines for important algorithm function

241
for targeted application areas and it could be an narrow or it could be broad. So this is always
important and it's been for a long time and it's still there. But today there are also other tools so
people talk about integrated development environments IDEs which is our tools that not only
provide access to a library with a subroutine but also with some software components that
automate some useful processes of the development such as debugging, code generation etc. And
then on slightly larger scale, people talking about software development kits where we
essentially can combine a number of this IDEs together. Even larger than people talk about
framework. And then you also have this important concept of application programming interface
API where an API as should be because of the name, for the name interface is rather the interface
to these tools and to these libraries, so an API one could say the logical representation of what is
in all these toolboxes that you want to access. It's not a priority here of this like to classify name
is all these enablers that exist in an area for machine learning according to these categories
because the borderlines many times are blurred there may be many reasons why people call
something what they saw that, so it's not really meaningful to try to achieve very short
boundaries here, especially not in a state where we have a very intensive in development going
in these areas. Unfortunately one can say that because of all these concepts and because of that
so many things are developed and so many things are interrelated it's not entirely trivial to
understand that this fast growing forest, I have used an analogy of software tools. But I will
mention a few, so in the in the next few slides I will mention a few of the enablers that are
relevant if you want to build machine learning systems.

Let's look at a few of the software development enablers and there are various kinds, so you can
see on the top of the list you can see a system called anaconda which is basically a distribution
channel or a platform for distributing Python and R programming language in a convenient way
for the users. And also handling the program libraries that comes with these languages in an
efficient way. Then there is a category of support systems that is strongly focused on support for
building artificial neural networks. So now say not only one can say that many of these tools are
support tools for doing efficient multi array calculations because as you have understood from
earlier lectures it doesn't matter where you start from if you have a vector machine or artificial
neural network many times you can transform those systems onto computational problem
expressed on multi-dimensional arrays. So anyway Tensorflow is that the system it is one of the
primary tools that is pushed by Google and probably is one of the most popular tools at the

242
moment. There is a competitor to Tensorflow called Pytorch which is sponsored by Facebook
and both these are open source so even though they have a very strong company coupling there
they are easily available. And both languages are very very clearly coupled to programming in
Python, primarily not only but primarily. Then we have a similar Microsoft system called
Cognitive Toolkit which have a mixed Python C connections. Then you have other systems. One
issue here is that many of these system are interlinked which may it's probably good idea because
it's useful but when you start to understand what's going on it's a problem because a system like
Keras it's a system of its own but it is a system that could run on top of some of the others like
Tensorflow, Microsoft cognitive toolkit and so on. Okay so these are the systems there are
similar systems Amazon has also a toolkit of this kind and as I already said even IBM now tries
to more and more and offer at least a lot of sub functionality to the original Watson system in
various contexts for their customers.

So here you find more of the system. I like to focus mostly here on the top two on this slide
because this system not only should need to provide efficient access to all the range or machine
learning algorithms, many times it's very important when you when you run a machine learning
systems project to be able to coordinate a number of resources and a number of steps, and one
such support system is this Jupiter notebook which is essentially which emanates from a similar
project developed product even this project called ipython, so they these two go together.
Essentially what this these kind of support system try to do is to help you to coordinate work
maybe in different languages because you don't necessarily have to just work in one languages
but in a variety of toolboxes and also to coordinate your data and also to help you to present the
results in an efficient way. So this is like a very popular complementary support system in the
machine learning context as this very mean. So the rest of the systems here are more just more of
the same. I mean for good or for bad there are many powerful actors here who want to have their
market shares and all the factors develop their own systems in a way competition is healthy
because it's sharpens and push the functionalities but it's also I would say somewhat confusing
and because it's not so that there is one tool that is best for that another tool for that and there are
just a bunch of tools for each of these actors and some are more popular, some are less popular
and this develops over time. So simply my advice actually here is that very much what you chose
is depending on the context yard I mean if you're an independent person, if you're a student and
you have no special connect you I think everybody has a connection even you're a student you

243
have a context you have your University and it may be so that your university or your
department or in the research group you work together with they have made some choices, so
they decide if they want to work in this environment and not in that, so and then it can be what
doing to make something useful it's always clever to use the competence of the cop of the
context you are in. Even more if you come to a company it's very clear that most companies may
made some choices here, so this means that the actual choice you probably have as a person in
this kind of setting is rather limited but because other people have made a number of choices for
you.

I also want to introduce your little to another realm of support system what is called here
technical computing systems and so technical building system is the application of the
mathematical computational principles of scientific computing to solve practical problems of
industrial interest. So it is not scientific computing its technical computing. So this kind of
system are dedicated software system for support of this kind of computing, so typically they
comprise implementations of a great variety of key algorithms in applied mathematics and some
extensions of that, but also some general program compatibilities. So even these systems have
some kind of programming languages built in for doing this extra tailoring needed to glue the
pieces together. Increasingly so now technical computing systems comprised implementations
directly explicitly of key machine learning algorithms. I would say some time ago you could use
this kind of system for implementing machine learning algorithms but they were no pre
implemented such algorithms you had to do it yourself in terms of the mathematical tools,
applied mathematical tools away but today there are more and more of these classical machine
learning algorithms built in. So essentially there are three categories of systems, so for a long
time there been a numerical analysis software system that really supports numerical analysis
problems solving but there also a category of system that started developed in the 60s called
symbol manipulation systems where essentially solve mathematical problems symbolically, not
numerically. I would say today the most widely used systems are not extreme in any of these
sense the system use today at least for machine learning are hybrid systems. In this stronger
(44:39) it's more common that the systems are proprietary than open-source in a sense that you
have to pay not just little but probably reasonable sum to use these systems. And there are a lot
of restrictions also then of course off of the chose.

244
I'd like to say a few words about some of the most widespread technical computing systems. I
believe one of the most well-known and popular are still Mathematica but it's a proprietary
system, it's provided by an organization called Wolfram Research and it's based on its own
general programming language called the Wolfram language. So this is very much useful system
it's a closed system. Maybe as a reaction to that at least some open source alternative have been
developed, so an example of that is the system at the bottom called SageMath which is open
source and that system was released in 2005. Otherwise the competitors in the original shiner of
systems to Mathematica I would say primarily our MAPLE and MATLAB. So MATLAB in
contrast to Mathematica is Primarily a numerical computing system in contrast to Mathematica
who is also so it's a balance hybrid and what we say on the other side of the camp you are the
system like MAXIMA who are really developed from the symbolic side the computer algebra
systems. One general comment here that I made also earlier is that for this kind of genre you are
probably very much constrained in your choice by the moment, your a typically University
design typically company decide whether you want to use one of these systems either you are an
organization that was Mathematica or use Maple or any MATLAB, you don't typically use all
these system means because I'm kind of mix it, so depending on where you are now the
University and our nature at the company we will probably them yeah the choice you have is to
use what the company already and decide it for you.

So the next theme and it's about hardware. So one very important factor in the success of the
recent successes of machine learning is the better hardware support, so what is mentioned a lot
today is the existence of what called AI chips or AI accelerators and machine learning algorithm
particularly there artificial neural network once, demands more computational power than what
can provide it by conventional CPUs. So new computing architectures are needed and these
terms are used AI chips AI accelerator or neural network processors because a lot of the efforts
recently have been on efficient computation for neural networks not only but primarily. So what
I mention here on this slide are kind of four trends for streams of work, so there is something
called heterogeneous computing which has been going on for a long time where you actually
design complex compute computer architectures where you combine various kinds of specialized
processors with conventional ones and you even embed them on the same chip. So this is a kind
of multi-core tailor-made tailored processor structures and it has been one way of achieving
better performance, what is talked about mostly actually and you probably have heard about that

245
other is the use of graphics processing units GPUs and essentially the story is that these
processing units were developed with because in the gaming industry primarily the way you have
a lot of visualization a lot of graphics you need very high performance. So this kind of
architectures was developed in that field however it turns out that and hopefully you have got a
feeling for that is that mathematics behind this are pretty similar when we talk about neural
networks and image manipulation so therefore it wasn't a big step to realize that it would be a
good idea to use or not only just to use but to adapt these kind of processors for their machine
learning tasks. So GPU is still probably the main trend in development here but there are also
other avenues there's something called field programmable gate arrays, which is another
architecture philosophy for processors and then something called application-specific integrated
circuits. So one can say that the GPUs have triggered the process and they are still dominating
but computer architecture is a broad field and there are many ways forward and not everything
happens on the graphic processing tracks also the other streams I mentioned here where our work
coming on. This leads to the hardware companies in their own, so what one can see is that
hardware companies that traditionally wasn't really interested in artificial intelligence and
machine learning suddenly when they realized that this technical sector could be a really good
avenue for increased and strengthening of their customer base, they also to some extent also
engage themselves not only in the hardware part but also in the software aspects. So you can see
that companies like Nvidia who is the key player on the GPU side, companies like Intel,
Qualcomm etc. they also lift their eyes to the more tool to the software aspects, also one can see
the opposite way is that companies traditionally so specialized on tech focused on specialized
hardware Apple Samsung Google Microsoft have started to interest themselves in these
specialized processing processor architectures, because of course for the reason that they do not
want to be entirely in the hands of the other can't category of companies. So there is a jungle of
systems coming up here so they're all I mean of course NVIDIA is very well known and they all
all the time come up with new more efficient versions or models. But also you can see on the list
here Google themselves produce processing units, so obviously now this has become a field
where the software and hardware companies be. So there is another aspect which is important
here for this meeting between hardware and software it's because when this development started
it was a very tricky to use this special kind of processor, so this means when you build the
system you have to kind of tailor your solution and your system for the explicit use of the GPUs

246
which is possibly tricky, so there is actually a category or programming language and tools
developed now you can call the GPU languages but you can also call them general gpgpu which
means general programming based on GPUs and there are a few of those who are well known I
think the most well-known is called CUDA and CUDA is very much coupled to an Nvidia but
there are also others coming up OpenCL, Harlan and so on and it turns out of course that if you
have a tandem development a very good specialized processor systems and also that you are the
driving force and you are monitoring the development of a popular language of this kind, you
have a competitive advantage but because your language quotation mark in a way is always
slightly biased to the kind type of professors to you push for, so therefore we can also see now
that all the hardware companies have understood this and also engage themselves therefore in
this kind of languages that way you have a strong support for getting help on how to utilize the
specialized processes in your system for the purposes you want.

So these were the repositories where you can assume that the data sets are have a more
guaranteeing level of quality, however there are now a lot of other open data sets, so down up
close you can go to them and back of course there is no guarantee that for every data set in these
repositories it's a trivial task to just use the data normally in many cases you have to have some
pre-processing of the data from these data sets, so on this last slide you can see some more
examples you can see some example the meaning of the big software companies like Amazon
Google Twitter etc. They collect data and they also exhibit this data in an open fashion but there
are also large organizations like US government World Bank who also make datasets available in
an open fashion. So this was the end of this lecture thanks for your attention, so the next lecture
in 7.2 for this and the last one for this week will be on interdisciplinary inspiration sources for
machine learning thank you

247
[Music]

Welcome to the second lecture of the seventh week of the course in machine learning. So this
lecture will be about the second theme for this week which are called Interdisciplinary
Inspiration. Essentially this lecture doesn't present so much new material, it merely collects
together all the parts that have occurred during the course where other disciplines have inspired
the work and development in machine learning, in particular but also in artificial intelligence in
general. Let's start with what is most important, so artificial intelligence and machine learning
are very much dependent on mathematics and statistics, I hope that this message has gone
through, I mean in this course I have tried to be on a level but not all the time borrow down into
this areas, but I've also said clearly and consistently that when you develop your knowledge in
this area, when you go deeper and when you continue if you continue with this area after this
course you will under all circumstances need to handle much more mathematics and much more
statistics. So let's start with mathematics, so there are a variety of parts of mathematics that come
into play, so I mean one important area is the vectors, matrices, linear mappings, with an inner
and outer products for those measures of similarity in Vector spaces, how you can differentiate
matrices and vectors, chain rule is an important element in certain corner of machine learning,
inverses of matrices, least square techniques, Eigen values, Tensors etc. But also geometrical
aspects are important so Geometric interpretations of mappings, other things like Hypercube
Discriminant functions and so on, and also Optimization techniques furthermore graphs and
Digraphs, I guess you have observed that graphs occurs in various corners also. So in a similar
fashion statistics is also a cornerstone for this area. So this is a traditional dealt with data
collection, data modeling, data analysis, data analysis, and data presentation maybe we haven't
really highlighted data collection and data presentation so much in machine learning, but of
course when you do a practical project it will be eventually there everybody have to collect data
doesn't matter where they you are come from a background where you call yourself just statistian
or you come from a background where you called yourself a machine learning scientist, so the
core theoretical parts of statistics are of course from the fundamental, mathematical statistics
based on probability theory, so all the fundaments there are important. In general machine
learning is more dependent on inferential statistics in a sense that statistics which want to draw
conclusions on whole populations from the studies of samples while something called descriptive
statistics primarily summarizes the samples themselves and don't do not draw further

248
conclusions. So specific areas that have occurred during the course are Markov processes kind of
Stochastic processes, Bayesian methods, Monte Carlo methods and also so on. So the influence
of mathematics on statistics is not I wouldn't call it just an inspiration, I will call it necessary
foundations for the area.

Let's move on now to some other disciplines, first looking at Theoretical Philosophy and Logic.
So obviously the classical modes of reasoning inherited from theoretical philosophy logic have
been had a strong impact on artificial intelligence in general and machine learning in particular
and those for the classical ones, Deduction obviously very important in the inductive logic
programming, Induction it's a foundation and general source of inspiration for most parts of
machine learning, Abduction as the basis for explanation based learning and backward reasoning
in Bayesian networks, and Analogy finally as a basis for case based reasoning and associative
memories. So hopefully you already seen these couplings during the course. The couplings
between logic they are improving and certain parts of artificial intelligence are also very strong
specific examples that has been mentioned during the course of the following, so early programs
in artificial intelligence like the Logic Theorist program by Newell and Simon proved theorems
from Principia Mathematica. McCulloch and Pitt when they analyzed the capability of neurons
for performing logical operation was based on a specific logic notation used by a logician
component currently Carnap. Logic programming is based on a specific theory proving
technique called SLD resolution. Finally LISP language is based on variant kind of logic
calculus called the Lambda Calculus. So I one can continue to count on that there are many
techniques in this areas there is a source of inspiration in logic somewhere.

Let's turn now to Linguistics. Probably you will find it more difficult to see the real coupling to
what we talked about this course on with respect to linguistics but I went anyway I want to say a
few words about this discipline, so actually many aspects of language including learning of
language is central to artificial intelligence and actually the theoretical and structural view of
language and its major proponent in Noam Chomsky who published his classical works of
Syntactic Structures in 1957, as I remember artificial intelligence was defined in 1956. So this
view of linguistics a very formal theoretical view of linguistics and very well in hand with the
developments in computer science and artificial intelligence during the same period. It also was
well aligned with the movement away from behaviorism towards Cognitivism in psychology

249
which will come to in a minute. So finally other linguists like George Lakoff widened the work
of Chomsky not only focusing primarily on syntax but also semantics and also highlighting the
importance of cultural differences and embodiment. The various approaches in machine learning
reflects a long ongoing debate in psychology where the pendulum in the last century has swung
between Reductionism and Cognitivism. So Reductionism strives for reducing explanation of all
mental behaviour into neuro physiological low level process. So you can easily relate here
toward a visual neural network approaches in machine learning. While cognitivism represented
by psychologist like Miller, Broadbent and Cherry and Bruner argue for the relevance and
existence of an abstract model of Cognition in terms of Symbols, Concepts and logically related
inferences. And hopefully you can also easily see there the analogy on relation to symbolic
machine learning approaches. So this is clearly the sub symbolic side of the coin and the
symbolic side of the coin. So if one look at psychology during the last hundred years the
pendulum as I said a swung between these extremes and in between them are various middle
standpoints, so if you start with Structuralism with I think which it was a very strong movement
to begin being of the 19th century, so actually there in that approach you try to define the simplest
components of what of our mind and then put the pieces together to model more complex
phenomena of the mind but in this approach is not very empirical actually the evidence was
primarily based on introspection on questions to people and there own reflections on their own
thinking and about self-reports. So as a reaction to that came Functionalism which was more
looking at the functionality or behavior, so try to explain the processes of the mind in terms of
the usefulness of the manifestation of the behavior and less theoretical more practical and more
purposeful. So that then developed into something more extreme which is kind of well-known
Siberian hope (11:21) psychology called Behaviorism, where ministers (11:25) like Watson
Skinner, Thorndyke Pablo. Assumes that all behaviors are either reflexes produced by response
to certain stimuli or a consequence of that individuals history including especially reinforcement
and Punishment in certain situations together with the individual's current motivational state and
control stimuli. So the parallel to the perspectives in reinforcement learning is the kind of
obvious here. Finally things changed again so we moved now more towards Cognitivism, so it
just thought psychology is also a part of psychology which more focus on the global phenomena
the holistic parts, so and you will see now when we come to neuro science that also neuroscience

250
has this very complex debate between researchers looking atomic phenomena or more realistic
phenomena.

So let's now look at an area which we have touched many times during the course and that's
categorization or if you want classification or concept formation. And actually I think it's fair to
say that the inspiration for how we think about categorization comes from not only one other
discipline but from several, it comes from philosophy, psychology and anthropology combined.
So what I want to stress here is kind of two views on categorization. So one view which I call
here the classical view of categories, this is very old roots from Aristotle and Greek philosophers
but also exemplified by work by later philosophers and psychologist like Bruner. Actually the
idea here the main ideas are the following, so idea is categories are arbitrary so typically the
viewpoint is that we as humans define the categories it won't primarily based on a language or
culture, the context we are in, so there is no underlying restrictions that that we naturally have to
follow. Also that categories are defining attributes the features we talked about during the course
and the feature value combinations what defines it's certainly the feature value combinations that
distinguish category from another. And all members of this category share these attributes and no
non-members share them, so there are no overlap between members of a category and the non-
members of a category, so the intention which is the set of attributes actually in a descriptive
form determines the extension exactly, this means that they define and exactly which these
members can be, the members of the category can be. So the member space has no internal
structure it's just have an abstract definition, all members are regard as equal and first-class
citizens of this which category and also the levels in a hierarchy or may not be in hierarchy can
may be a lattice, also at the same status, so the one level is not different from another in such a
structure. So as you see this is a very well defined crisp way of defining categories. So actually
the alternative which we'll call the Modern or a natural view categories or sometimes it's called
prototype theory you will understand why in a minute, so it categories here are in many cases
motivated by properties of our sensory system and the world surrounding us. So actually in this
approach if you recognize that there may be the model ways that could still be cases for objects
where the categories can be defined more in an arbitrary way as in the classical but in many
cases it may be so that there are these restrictions physical restrictions of our world and our
sensory systems ourselves that actually constrain what is reasonable to do when we get there it's
actually then the view here is that one build categories around so called central members or

251
prototypes defined by sharing, they share more attributes with each other members than with
non-members. So therefore the membership are kind of graded on based on this kind of typicality
and obviously typicality generates a topology of this member space and the borders between
categories therefore are fuzzy rather than crisp as in the classical case. Also in this view the
levels in a hierarchy or that is kind of different stages where some researchers argue that the
middle layers of the hierarchies which are more correspond to everyday things called the basic
level, while the more abstract their categories and the more detailed levels equal superordinate
and subordinate levels have other properties. So the shift from the classical view to the modern
view was kind of triggered by a number of work and actually a lot of these works were based on
a very popular topic it's actually when people looked at color terms, color terms what kind of
colors do we fancy to use and perceive in different concepts and actually there was a famous
study by researcher named Berlin, where he looked at color terms in ninety-eight countries, so
actually every culture have different views on this. So if you look back at the course I hope you
can see the coupling here so that you can see that some of the learning methods we looked at are
still very much influenced by the classical view, while some others like the instance-based
learning approach, some kind of some forms of the clustering analysis like the k-means approach
for example, are very much inspired by the modern view. So hopefully this explains it's a very
natural coupling, I mean these are the two major ways of viewing categories and they are
naturally reflective and also in the machine learning.

So let's turn now to neuroscience where the area which is obviously influenced machine learning.
A lot so the object of study on the nervous systems animals in general and humans in particular,
special focus is also on the brain and as we have talked a lot about the atoms of the nervous
systems are neurons and an observation here of course that's important that in the nervous system
everything is analog not digital as in the artificial systems. So I wouldn't repeat the terminology
we have a terminology here about the parts and aspects of the neuron and what I focus here now
actually are two major questions discussed in neuro science of psychology for almost century. So
one issue is that to belief that brain functions contribute to specific behaviors are primarily local
limited brain areas what's called the principle of Locality contra the belief that large portion of
the brain contribute to all kinds of behaviors which is often called Holism. The second issue
here is the belief that cognitive models of behavior are just a figment of our imagination and that
all that exists is the myriad of atomic neuron activities which is then termed Reductionism and

252
different people have had different standpoint also with respect to whether everything there is the
mobile level thing and it's not meaningful to try to model something on a higher level. So the
above two big questions will be further elaborated now well by we'll look in more detail on the
work by a number of famous neuroscience researchers, we talk about Karl Lashley, we will talk
about on Donald Hebb, and we will talk about David Hubel, Wiesel and a few other researchers.

I want to start by saying a few words about the work by Karl Lashley. So Karl Lashley was one
of the most influential your scientists of the first part of this 20th century. He started as a student
of the father of behaviorism but developed into the most clear scientific proponent for balanced
view between Holism and localization. So furthermore Lashley, in spite of being a scrupulous
experimentalist seriously questioned the more extreme believes in reductionism. He moved the
focus from a multi passive view of the brain primarily triggered by external stimulus. So
essentially the idea before that was that the brain kind of sleeps and it only do something when
it's triggered, while Lashley promoted a view where you have an almost always active brain
doesn't matter when you have internal or external stimuli or not, the brain is always active and it
has a central control and Hierarchic control that proactively accommodate to external input, so
essentially in the hero here is that brain is active and it's has a central control organization that
can and handle the whole and react took to the various input that takes place. So there are few
very important concepts that was introduced but by Lashley, so and one of them what is a
Equipotentiality by that he meant that large areas of the brain potentially has the possibility to
contribute to specific behaviors. In many of the experiments done by Lashley and others at that
time, actually people looked at brain injuries and even artificially caused lesions (24:37) of the
brain and obviously then studying what was the behavioral effect of that a certain point in the
brain deliberately was damaged or was damaged from some natural causes. So this means that as
a part of the brain had a problem or was damaged then obviously according to the hypothesis of a
quick potentiality, others areas could potentially take over that thought from the beginning had
been the major contributions to the game. He also introduced this idea about mass action which
has a related meaning, meaning that the consequence of a brain damage is more proportional
rather proportional to the amount of damage then exactly where the damage took place and
finally he introduced the idea of plasticity and plasticity that means that if one part of the brain is
damaged than other parts, then can gradually take over the contribution then making the
individual able to evoke the same behavior necessarily before the touch occurred. Lashley also

253
seriously researched the possibilities for sharply localization the manifestation of singular
concepts or memories in the brain. And this then relates then to actually be the reflection on
whether it's meaningful to talk about, where a specific memory reside for exactly which
phenomena in the brain correspond to a specific symbol or concept, and he called this endeavor
this search for the Engram and I can say that his conclusion was kind of empirically negative, his
conclusion was that essentially it was very fruitless to find this exact evidence as fruitless as the
search for the only grain, the reason for that typically in functionality seems to be spread over
many areas but the way he expressed this in all his writing was that one was merely as temporary
the negative observations from the research indeed, didn't make any very radical standpoint what
theoretically could be possible but this means that other researchers who really want to find a
graph researchers like Simon and Newell they were rather encouraged by reading Lashley the
opposite.

So let's turn to the work of Donald Hebb hope remember him. In 1949 he published his theory
claiming that an increase in synaptic efficacy arise this promised presynaptic cells repeated and
persistent stimulation of a postsynaptic cell. So this Hebbs theory, Hebbs rule, Hebbs postulate,
whatever you want to call it summarized that cells that fire together, wire together, the more they
fire together the more they get connected, and essentially one could say then there are two ways
this can move, so if two neurons on the other side of a synapse or activate synchronously then
the weight of that connection should be increased, but if they are not activated synchronously
then the weight of that connection should get decreased. So as Hebb described the overall
learning phenomenon of the brain, it was actually a combination I mean can think that the Hebbs
series only local and of course it is so Hebbs Rule actually describes what happened in each part
so the local learning enabled are directly related to Hebbs law, but also Hebb in his writing
emphasized more Holistic learning in the sense that when this I mean what happens in a neuron
was most similar and it could happen to me all neurons in parallel actually. So if one considered
what would happen in the whole system in his view what was built up what the serial structures
or more complex structures performed, so as a whole when reading Hebbs the picture your get is
its both the focus or what's happened locally but also how the brain develops more holistically.

So the next pair of researches that we have mentioned during the course are David Hubel and
Torsten Wiesel. And actually during the 1950s and 60s they did the experiments that showed

254
actually that specific neurons in the visual cortexes of the cat and monkeys individually
responded just to what happened in special regions of the visual field. So these smaller regions of
the whole visual field they called receptive fields and what happened in receptive field them
primarily triggered locally the specific neurons, but of course the receptive fields will overlap so
they are not fully separate and so this whole process of responses of specific neurons to set
subsets of stimuli within these fields they referred to as neural tuning. They also hypothesize that
there are two kinds of neurons, there are those that handle simpler phenomena in the visual field
like things like corners, I mean geometrical things, these neurons that can detect this kind of
simplifier phenomena they called simple cells while there were also other neurons that could
react to more complex phenomenon occurred in individual field, and they also hypothesis the
model where for whole pattern or image recognition tasks there is need to be some model how
this is cascaded where these kinds of cells work together in a more or less hierarchical fashions.
And actually their work is then one of those examples that the focus again worked towards the
locality obviously.

So here is just included one of the slides that we showed earlier when we talked about image
recognition actually the current model of more less Human Visual system and I think the main
message here is that people believe at this point that we have a pretty clear model of this
subsystem.

Finally I want to mention more briefly some other important work in neural science at also to
some extent influenced the way we design artificial systems of the same kind. So and also
researchers that have different standpoints or contributions in in recording the two questions I
raised initially. So Roger Sperry, a researcher that is famous for his work on the two brain
hemisphere actually but first one to very clearly observe that the two brain hemispheres have
different functionality while the left right are different ,the left a stronger role for language and
conceptually oriented task while the right is for more focused on spatial functions as an
example. Sperry also studied them just this is a phenomenon because it's not entirely true that the
one side of the brain takes care of everything for a specific function map, so even here that can
be some plasticity, so even if the left is better on language to some extent, the right also
contribute to languages and the randomness of one of the other that can be some plasticity also
but still there is domination of one over the other. So some Russians researchers Alexander Luria

255
and others looked also then into the hierarchical organization how different brain regions which
were which regions were dominating for a certain kind of thoughts and which were more I would
can say subcontractors. And also what they try to study is how this kind of role play between the
areas would change over time so for example for a child that could be a certain role play
typically when you are doing a more sensory regions dominate over others why when you get
older the more planning oriented regions get an upper hand. So essentially the one can say the
governance of our thinking change over time in this way. Other researchers looked for example
at the relation between simple neural functions and the functionality of the whole organization,
but I would say the trick here was that the researchers who did a lot of very little of emotions
(36:50) on this they were clever and chose to study very simple organism I mean if you study the
humans and more advanced animals it's a very complex, so if you really want to see the relation
between a few neurons system then it makes sense to study very simple organs which they did
others looked even more on the localization so that's another example then related to what we
saw in mission (37:32) so for example there are some famous studies of the song of birds where
obviously the ability for singing in a particular way seems to be very localized in birds nervous
system. So finally also a kind of famous standpoint that was put forward by an article called
Pribram this an analogy between the brain and an hologram so I can say this is the extreme
holistic view but in the same sense as a hologram being an image recording where every little
piece of the recording can be used to reconstruct the whole image in the same sense every part of
the brain contribute to everything, but I assume you already understood that this probably not
true there is some balance here between localization and Holism look at the sum of the
contributions I have taken up.

So finally something related in neuroscience but actually also related to logic is the work that
that has been mentioned that that is considered the starting point of artificial neural networks
that's the work by Warren McCulloch and Walter Pitts but actually when that work was done it
was not consider computer science it was all considered artificial intelligence, was considered
but in neuroscience researcher work together with the Logician and what they try to prove was
that the architecture the way they thought the neurons in the brain works could function as an
architecture for something that could realize or implement logical operators. So this still fits as a
source of inspiration because when this was published in 1943 by the machine learning on
artificial intelligence existed and even computer science was an embryo at that time.

256
So a few things that have been also mentioned let's not forget them everything is not inspiration
for neuroscience we also saw that there are some systems we built inspired by genetics and
evolution theory. The evolution of computing in general and Genetic algorithm in particular
inspired by Darwinian ideas about evolution and the survival of the fittest. So in this models one
use these terms you look at populations, so you can read a data set as a population, you are
chromosomes which is more like a data item for every position in the chromosome corresponds
to a Gene read the same thing as feature, and continuing that this kind of systems are try to
mimic the way one can see evolution works. So you have generations of populations and in
every cycle they are evaluated with respect to how the fit there by using some Fitness function
the fittest subsystems are allowed to reproduce and representing place or either by something
called crossover or something kind (41:33) rotation and there is a certain order of its phases. So
even genetics and evolution theory has passed some extent inspired working machine learning.

I want to end this lecture by mentioning some inspiration not only from computer science but
also from physics and engineering science and I hope you will make have some recollections
when I go through them. So for example there are some inspiration from thermodynamics so for
example if you remember the information gained measure that we talked about in context of
building up this decision tree, one used an entropy measure as the basis for the information
gained and actually in interpret measure over the probability of a class membership and this is of
course strongly influenced by the concept of entropy in terms of thermodynamics and actually
entropy is a measure of molecular disorder so the second law of thermodynamics states that
entropy can never decrease if no order is enforced by external influence. Also some ideas from
statistical mechanics has been included so the energy function in Hopfield networks is inspired
by some models from statistical mechanics called the Ising model and actually in this model
consists of discrete variables but in the statistical mechanics case represent magnetic movements,
of moments of atomic spins we can have two states but the analogy yeah it's clear. So also you're
putting remember in Boltzmann machines and we have some processes clearly inspired by
processes in metallurgy where you heat up a material to very high temperature and then slowly
cool it by that in shaping that one can more easily reach an optimum state. And finally very
clearly in reinforcement learning there are very strong inspiration one comes from control theory
operations, analysis and cybernetics I mean one could probably say that this is more than

257
inspiration I mean in the same sense and math, statistics are important for a majority of various
of machine learning and theory and really really fundamental to reinforcement learning,

So by that I want to end this lecture thanks for your attention the next week and the course will
be the final week and essentially we will focus on the repetition of assignments, related tasks as a
rehearsal for the final exam, I think that will be the most important thing the last week but also
try to show you some larger examples of applications and some demos thank you.

258
[Music]

Welcome to the eighth week of the machine learning course the final week it will be dedicated
primarily to preparations for example but also to gives examples of applications to further
illustrate what can be done with techniques from this area. And so this is the first lecture and we
will here look at back at some assignment related tasks and my purpose is to try to give you a
picture of what your exam will look like. The general outline of the upcoming exam will be as
follows, there will be in the order of 50 tasks, these tasks can give you hundred marks in total, on
the actual exam all tasks will be given in multiple choice type form, so this is what it looked like
so a disclaimer for now for today for this lecture is I'm going to exemplify tasks here but not
using a multiple choice type, I do this because it's more compact than the multiple choice time
it's pretty awkward when I'm just want to simplify which kind of tasks you can wait for. And so
there will be as for the assignments like a balance between recall type tasks which are will only
be worth one mark and but then there also will be problem solving tasks which are worth more
typically two to three marks, there will be a few more challenging tasks four and five marks, and
what I will focus on very much today is how these tasks will be spread over the course, so the
idea is that the exam should cover most part of the course, so I actually try to model the course
now into 21 thematic categories and covering in the content and it's not necessarily so that an
exam all these categories will be represented it could be so there are more questions or tasks on
one such and something or it could be so that there are no question no tasks at all and so that will
be up to the next exam when you see it. Tasks will be well aligned that's the ambition to the type
you have seen at the assignments they can be slightly different in style but aim is not to move too
far from the tasks you have seen so far. Also I want to ensure you that the tasks will be tightly
coupled to the lectures and of course the recommended readings you should consider as useful
but if the recommended readings are more to enhance your understandings all matters discussed
on the lectures, so you do not have to fear tasks on topics only mention in the recommended
materials even though the recommended material can illuminate or explain better or deeper
something mentioned lectures. So that's the general outline and let's turn now to my proposed
thematic categories, as you understand this kind of pretty complex course can be structured in
many ways, so there is no actually the right or wrong way of structuring your course, it's up to
your taste more or less. So here on this slide you can see it is 21 categories. At the at the top left
you can see very general topics like elements from basic mathematics and statistics elements

259
from logic and computer science elements from theory improving an artificial intelligence, I
mean these are not the primary topics for the course but they will more or less always somehow
come into play in the various corners of the rest of the course, so therefore they are here and
therefore there will be a few elements on the exam that relates to that, but that is it's not a major
part. So after that the course continues and it goes into the realm of learning scenarios, content
about how categorization works, a few different approaches to inductive learning maybe as
slightly side topics and some material on Bayesian networks and genetic algorithm they are part
of the picture but they are not super central as you may have recognized for the course, then we
come to instance based learning which is an important topic with many connections and that's
follow up by cluster analysis as an example of supervised learning from that the course moves
into learning scenarios where one can say that in one form or another there are few available
domain knowledge that can be used as a strong basis for the learning process and that is more or
less ended with reinforcement learning which is a natural starting point for moving into the realm
of neural networks, and as you can see here there are and we also I guess you already realized
from the material of week six, there is a strong focus on neural networks and the main reason
with that is that is one of the sub areas of machine learning today there is a lot of limelight on
and what you can the next slide is just to illustrate how the flow of the course has moved more or
less but exactly it's a linearization actually I ambition has the course has developed more or less
following the path just orally described.

What I will do now is simply have a pretty fast walk through these subtopics or I would say
content categories, there is absolutely no ambition here to present anything new so if there
happened to be a little detail that is new it's more of an error rather than something that was
aimed for, I'm merely tried to collect on each of these slides some keywords that would give you
the right to lace and look when you want to recapitulate what was articulated and focused on in
the earlier lectures in the course. So the first slide here is on mathematics and statistics and as I
hope you understood vectors matrices and tensors in general a lot of mathematical sub topics one
can say related to some type of race is important and operations of those they will occur over and
over again within the course, also some geometrical concepts relevant for visualizing state spaces
as some optimization techniques occurring in particularly in relation to the enforcement learning
graphs occur pretty often, so the very basis of graph theory, properties of graphs are important
and also to some extent simple aspects of calculus as integration and differentiation. So these are

260
the key mathematic topics needed, some statistics is also touched primarily inferential statistics
where we draw conclusions about the population from sample I mean the core is actually the
mathematical base is the probability theory, particular with stochastic processes, Bayesian
methods and Monte carlo methods. So this is the picture for math, so very simple questions or
recall and also very simple questions problem-solving character can occur for these topics.

Let's move on to logic and computer science. As you understand what we will only scratch the
surface here but also elements from logic and computer science have occurred in the course and
central for learning is the relation to the classical inference styles like Deduction, Induction
Abduction and so on and so that's the key part and with respect to computer science which is an
extremely wide area of course we mostly related to some basic knowledge about algorithms and
data structures and to some extent the databases in the week 7 we also talked about supporting
tools, so of course we also have touched programming languages to the distributed computing
API and programming libraries and Technical computing systems. Then going on directly I think
to artificial intelligence and Theory improving and there are a lot of couplings between theory
improving and artificial intelligence but and we have talked about that in various corners of the
course, we also talked about knowledge representation schemes that's necessary we can talk
about learning and we talk about knowledge representation and one particular scheme that was
highlighted was logic programming. Also such techniques occur in various corners of the course
and so now we are moving more into what a machine learning, so as you remember very early in
the course that was again some general lectures about the different types of learning scenarios
whether we look at supervised learning with pre-sorted example or unsupervised learning or the
very specific situation of reinforcement learning where the behavior of the system is evaluated
and graded by some external environment, we also discussed distinction like batch learning
contra online learning, batch learning being roughly that you collected all the examples all the
instances makes the learning process based of them, in principle you have all the instances at the
same time while online learning works more dynamically in a situation where new instances
come up over time. Of course important is the relation between classification and regression that
also occurred systematically during the course, the relation between symbolic and some
symbolic learning different to situations where we want to adapt the system more of a Data
analysis scenario where we don't really have a system in a narrow sense that's the behavior which
is changing and then a number of separates more specific concepts always important when we

261
discussed the various algorithmic overfitting, underfitting, linear sparability etc. Also in the same
part of the course we talked about categorization and categorization also being close to semantic
networks for the reason that many times in artificial intelligence categories have been
represented in terms of synaptic networks. So what we did here was to look at basic principle for
object oriented descriptions, the description of objects in terms of features and the building up
structures among categories and some interdisciplinary basis for how we do categorize a world
and finally a little more detailed information about the format of the so called Semantic networks
the relations model in terms of edges in the network.

Okay leaving that we come I would say two more to the core of the machine learning course, the
kind of classical core algorithms for doing induction more or less without a pre-existing domain
theory. So here we talk about the difference kind of language that's needed to represent the
instances and the hypotheses, how to define relations in the hypothesis space, typically in the
form of mathematically or partial ordering also distinguishing between different approaches to
do this either data driven starting from the examples or by generate and test which is it
essentially means the opposite which we start top down building up and test hypotheses and see
how well they fit. So data driven, bottom up, generate and test strategies. Then the focus of this
was more on comparisons of the areas data-driven search strategies, like depth first, breadth first
and version space.

Let's move on after that we looked on decision trees and how to build up decision trees, both for
classification and regression, and the method is essentially top-down so we talked about various
properties of the relation between the instances and the built up trees such as Purity/
Homogeneity we spent some time on various information theoretic measures that can help us to
guide us, in how we build up the trees and doing some calculations of the simple calculations on
those. We also looked at this particular algorithm which I would say is the mother algorithm in
this round called ID3, some simple examples of using that. Then we also touch but not so much
in detail how we can handle various situations that can occur like underfitting and overfitting and
that's basically overfitting how to handle that by pruning of the tree. Finally we looked at
alternative approaches, how to be a laugh parallel reasoning (19:05) in parallel trees and then
used the best tree or the average result from using all the trees together, one example is random
forest etc.

262
And so after decision trees we went into Bayesian networks and I would say what we did in
Bayesian networks was not so merely to look at the learning of Bayesian networks even if this is
an area itself but rather the basic properties of this networks and at the core there we have the
Bayes theorem which helps us to infer causes from evidences and we also had an assignments in
simple calculations so based on the use of Bayes theorem and also we looked at very very small
examples where we built this very small Bayesian network for particular examples and at the key
of those exercises is the built off of the conditional probability tables coupled to the network
nodes. So this is the focus here.

Let's move to the next so genetic algorithms, it's an important area but there are been many
things covered on this course and not everything can be treated equally thoroughly, so of course
my judgment is we had a pretty shallow descriptions of genetic algorithms focusing on the basic
ideas the way you referred need to represent things with the terminology inspired by genetics and
also looked at a typical algorithm scheme including various faces such selection, the best gene,
evaluation of the generation of new generated creation of new generations and so on. And finally
and there we also had some exercises coupled to that very small exercises related to the key
operations in particular the crossover relation which is the main operation for reproduction and
finally we also shortly described how one can implement rule-based systems in terms of genetics
and thereby by doing that one get a learning property of the rule-based system, oh you have a
rule-based system you can apply it you can look how it works, you accumulate evidence did you
use that evidence for evaluating the fitness of the rules in the sense how much did the rule
contributed to a good outcome of the problem solving and when it then you essentially create a
new generation of a rule base where some of the rules are taken away some of the rules are
reproduced changed and you get a new generation and then you can apply them again, and also
strange enough I think it which I also highlighted as a parallel, I mean when we come to neural
networks later it's important to have a mechanism also to back propagate the results or return
given from the outside world and essentially grade the performance of the various components
neural network case it's the performance of the various weights in the system but in this case
with a classifier system implemented as a genetic algorithm it's essentially becomes a grading of
the specific rules but one need is similar back propagating algorithm as neural network case and
in here it's called the Bucket Brigade algorithm. So nothing is new under this one.

263
After that we turn to instance-based learning which is a lot of content in that section, so where
we discussed instance based learning in general, the structure of the instance space, we focus on
the K nearest neighbor algorithm, we look at distance and similarity matrices, we also looked at
the so called weighted nearest neighbor algorithm, we did them discussed binary linear
classifiers, we also go more went more in-depth on to support vector machines and finally we
discussed the situation where we could buy smart mappings from a two dimensional space to
high dimensional space typically maybe three, could enable the use of support vector machines
also for nonlinear cases. And as a natural extensions of that we went into cluster analysis I think
a lot of the focus there what are looking at the large variety of different kinds of clustering
techniques and I think I would say we mostly know Partitioning based clustering, K-means
clustering is the main approach, we looked at hierarchical based clustering and also density
based clustering. There was some focus there on we had some focus on the distance similarity
matrices which are more or less the same topic as for instance based learning and actually we
didn't do much calculation in this realm there have been some examples on the assignments on
how to calculate proximity matrices and that should be distinguished from the similarity matrix,
a similarity matrix is concerns the distance between the instances while the proximity matrix
concerns the distance quotation mark between the clusters.

After that we went into the realm of learning in the context of prior knowledge and there were
two parts initially, that are more closely related to two to the existence of theories in symbolic
form. So one is inductive logic programming and what we did here was to discuss all some logic
programming fundamentals but also the basically the two main ways of doing learning in the
total in a inductive logic programming it's one based on specialization, the other based on
generalizations, and we showed some examples on these two variants.

After that we went into explanation based learning which is essentially the situation where you
have an almost complete domain knowledge or domain theory that you want to kind of complete
or finalize and also explanation based learning in many cases doesn't mean that totally new
knowledge is created, but rather that the available knowledge is compiled in more efficient
forms.

The next big block is Reinforcement learning, we talked about the terminology, we talked about
the concept of value function, we talked about the most common way of modeling reinforcement

264
learning situations which is the Markov decision process model, and then we talked to various
ways of attacking the solution to such problems, we talked about dynamic problem of
programming, we talked about the Carlo simulations and finally time difference models with Q-
learning as our main examples, and of course we also discussed some more principle distinction
between various approaches, here passive versus active, on policy versus off-policy and so on.
So this became I would say pretty substantial part so even if it's just one sub theme, here I would
say yeah the real comment is that's not always of total balance between these parts, some are
much more context dense than the others, if we look at this what I call here subtopics.

And before we went into neural network we had a pretty short lecture on Case Based Reasoning.
Obviously this topic which is an area of itself has not been so well covered in the course either
but let's move now to, I was able the most heavy the subtopic set of sub topic is related to neural
networks, so first there was this lectures based focused on the single atomic parts, the neurons
and of course the precursor to the current neural model called the perceptron. And of course
important at the core here are the weight updating scheme as for as we will see four most of the
artificial neural network practice, and there was also some talk about various kind of activation
functions, that can be used and the properties of those.

And after that comes to say the classical core mythology part of artificial neural networks which
is the multi-layer networks where the neurons are the part and the methodology for feeding
forward signals through the network for estimating the error at the output level and then for back
propagation of the outcome of that error estimation as a basis of the back propagated error
constituents updating weights. And a natural follow-up is then an extension of the basic ANN
architectures which are the recurrent neural networks which is essentially a way to mostly most
cases artificially creating a possibility to handle States by introducing these special units that can
be unfolded to handle the multiple states of a specific item in the model. So we discussed the
kind of standard version of that we call it Vanilla recurrent neural network but of course
Illustrated how it can be extended with putting these elements in multiple layers and also how
there can be communication back and forth in the same layer, and so where we really did what
looked more in detail of this works more on the vanilla version and the unfolding on that. We
also talked very shortly about short and long-term memory which is a very successful variant of

265
this successor in the sense that it's being used with a win great success in certain applications
areas.

And then more or less wait for a little while we moved from the mainstream ANN into the realm
of associative memory, of course also still modeling associative memory by an artificial neural
network and the focus there was mostly on the approach by Donald Hebb as it inspired by
cognitive models of the brain.

We moved on to further realizations of associative memory two versions of that one called the
Hopfield networks and another called the Boltzmann machine. So we discussed the Hopfield
networks, we discussed the energy concepts that it is used to guide us in the search for optimal
solutions and also we in the Boltzmann machine more in particular on the process called
Annealing inspired from metallurgy by essentially heat up that's an analogy of course heat up the
system and then slowly cool it down with a hope that is heating and cooling sure in some cases
get her out of undesirable local maxima and bring us to the global minima. So then actually the
end the core content of the course was focused on convolutional neural network, as you may
remember convolutional neural language primarily targeting applications in image recognition,
so we both look a little about the source of inspiration for the basic operation here convolution
also local at the interdisciplinary inspiration sources for this whole architecture which is actually
the special organization of the convolution system. So we looked at various terminologies and
various items occurring in a convolution neural network architecture and we showed very very
simple examples how these central operations could be done primarily convolution and Pooling.
So that's it we reach the end of the sub-themes.

So that was all the subthemes. So what I've done now and I this I will not go through in detail
actually it's here for your service, is here to just give you examples of the kind of questions that
can come up on your exam relating it's very important to now I want to stress that this is just
inspirational examples to make you focus easier when you now go back and rehearse based on
the other material you have from the earlier lectures, so what you find actually here is a list of
questions that I hopefully makes you can guide your preparation studies in the right way, there
are many questions that can be put obviously with each little category, also the other disclaimer
is that for space sake here I didn't want to give this question in multiple choice look but I want to
stress that on your exam all your tasks will be formulated in multiple choice form, they will not

266
be open questions like here but as you understand every open question can be re-engineered into
a multiple choice for which happened on your exam. So please have a look and use this source of
information so it may be so that some of these questions have occurred in the assignment, it may
be something that day but mostly I don't think they have but it could be so not that these are
questions examples are exclusive. So this were the simple recall questions one mark and then
you got a similar list of example questions with the same disclaimers for the more problems or
your own solving oriented one which give a couple of marks each depending actual number may
depend on the judged complexity of the small task. And finally I also gave you some example
spread out across the sub themes of the little more complex tasks, that you will have not many
not so many as this, so I mean if you want to give the have the picture of the balance between
tasks go back to the first slide I gave you, so don't believe you we get ten hard questions this will
not happen we will get a handful or fewer hard question majority of the question will be mark
one or two three questions.

Okay so good luck with that I hope this lecture has been useful for you and it can guide your
preparation for the exam, so thanks for your attention we will follow up with another lecture,
however I want to state very clearly that this lecture application and demo example is only there
for yourselves it's an absolutely no bearing on the exam. So I advise you now to focus on the
preparation for exam so and hopefully what I've given you today is useful, this very final lecture
it's just to give you some illustrations of the usefulness of the area and as I said absolutely no
bearing on your exam thank you so much.

267
THIS BOOK
IS NOT FOR
SALE
NOR COMMERCIAL USE

(044) 2257 5905/08


nptel.ac.in
swayam.gov.in

You might also like