ok ah so welcome so this is our week two and
this is where we get started with understanding
on very basic and preliminary concepts of
deep learning and thats what we call as deep
learning with neural networks so as you know
that in the first week what we have learned
is on the basics and classical ways of visual
ways of computing and where we were getting
down images and then relating each image to
features itself and as we relate down images
to its features and these which are more compact
representations of how images are represented
and from there we get down to something called
as a classification problem or associating
a categorical label to it
and we have done it through the whole classical
way which is get down an image on the labs
side you had also learned on how to code down
and extract on certain features which we had
studied in the previous preceding lectures
and then subsequently going down and using
a neural network for classification purposes
so a very simple perceptron model and then
using it to classify and that thats ah together
completing it to the lab
now today what we entered into is the first
introductory lecture itself and as a lecture
is known its called as introduction to deep
learning with ah neural networks and what
happens within this introduction to deep learning
with neural networks is of the way that ah
we would be starting down understanding as
to what we have as a relation between these
neural networks and why this word has deep
comes into it but before starting down any
of these aspects over there the first introduction
which we need to have very clear in our minds
is about what do we define as something called
as learning
now if you go down by the very classical definition
of learning by processor tom mitchell so you
will get down that case is classical textbook
on machine learning is ah what outlines it
out and the outline is something like this
that a computer program is said to learn from
certain experience e with respect to a certain
class of task t and a performance measure
p so if you see there are three attributes
to this activity of what is called as learning
or what he calls has learned so there is one
factor which is called as an experience e
there is a particular task which it has to
perform t and there is a performance measure
p
so if you go down through our earlier first
week lectures and that experience you would
get done that what happens is that we were
trying to do up one particular task of image
classification right so we are not doing any
other task so this this was like if i want
to just find out whether there is a ball in
the image or there isnt a ball then thats
the only task i am trying to solve over there
now in order to solve this task i was gaining
certain experience e and that is by looking
into multiple number of images so the more
the images i looked through the more is the
experience the more the number of epochs over
which i translate the the more is the experience
over there now from this basis of t and e
i was using a certain measure called a p and
this performance measure p which i was using
over here there is something which is my error
function
so that was the cost function which i was
using down over there now as you see that
as my experience was increasing which is my
number of epochs over which i was translating
and the number of images which i was looking
down over there my error was coming down and
that means that i am somehow able to measure
my performance and see that the performance
is increasing so as the performance increases
it becomes more and more accurate and accordingly
my error keeps on going down ok
so thats what is saying that if its it said
to learn if this performance on a task t which
is of my classification as measured by this
performance p improves with experience ok
and now this definition if you really try
to introspect onto this one this definition
is in no way very different from how human
learning is all centered around in fact human
learning is also quite similar there also
as as human beings when we say that we are
learning about something then the whole task
of learning is when we are able to really
getting more
and as we get more and more experience and
thats what practice makes i mean man perfect
so with that one so you get more experience
and then your measure becomes higher and higher
with respect to a certain class of task itself
so thats what goes down by the very classical
standard definition now once we have been
able to define that one lets look into trying
to demystify what this would mean
now lets get down with a very basic problem
and that basic problem is that i have an image
and now i would like to give a word equivalent
description of this image so what that means
is say that this is an image and now a lot
of you upload your images on instagram on
facebook and all other social networking sites
and anyway now the point is what you are doing
by doing all of these image sharing over there
is just sharing some sort of an experience
over there so that people get to know where
have you been what have you seen what have
you learned whatever you experience and then
just sharing it across so that the more you
share the more collective knowledge is what
humankind gains over there
so if this is the image which is given down
and now the question is that say that is a
blind person who is who is not able to see
things now you would like that blind person
will still on your friend list or maybe not
even there i mean this this blind person to
somehow know and contemplate on your own experience
itself
now for that what would happen is that the
blind person has to somehow be able to understand
what is there in the image and that necessitates
an action something like a image to text conversion
in order to get down whatever it is annotated
and present in this image so that means that
if this image is given can i have a text equivalent
of this one or a very simple thing which is
called as an image captioning problem as of
today
so what will happen is something like this
that as you would see that initially a computer
program it if it is a so it will be doing
this sort of like divide something into some
blocks so using these blocks over here it
will know that ok these are the blocks which
are related to somewhat and it can understand
that if it can identify these blocks independently
then it can obviously give you a particular
reasoning over what this is ok so what makes
it would do is that each of these blocks it
will start annotating each of these blocks
now if you look into that curve on the side
over there what you would see is that as its
able to understand and recognize each block
itself so you see that your experience is
gaining and then your performance is also
increasing along that one now going down through
that one what happens is that in the next
instant that it will be able to identify some
more objects over there and they are those
faces which you see and eventually it goes
down it identifies what is the great war logo
and there is a great wall tower
and finally it can identify these different
people if it has a corpus of all the faces
over that together and then write down a sentence
equivalent of whatever it has identified over
here now the interesting aspect which happens
is if you see that its able to come down from
an image why all of this segmentation and
recognition and identification and then to
a final sentence equivalent description through
through getting a lot of experience
so that necessarily means that its not a one
shot job so the whole computer over here took
a lot of time itself in order to get down
to that experience now as it took all of this
a good amount of time coming down over there
so what happens is that over as it gains experience
it also gains its performance and the best
performance is when you are able to get done
a sentence equivalent so initially when you
are just able to give down block block blocks
over there so it will say that the image has
multiple number of fragments say some seven
seven or eight fragments in which you can
divide it out
now within each of these blocks which are
fragmented blocks say there are four of them
are human beings and there are there is a
mountain and there is some sort of a logo
kind of thing but they dont match closely
to the sentence so there is a lot of error
so finally when you go closely on the sentence
which says as kim they bluth kim young and
wong standing near the great wall logo at
the great wall tower is behind them this is
exactly what this closed to the sentence and
you get the best performance error coming
down over here
now now that we know that this is what essentially
we meant down when we were saying down that
its learning something the next objective
is to understand what was it learning and
how was it learning more than what it is actually
how does it actually go on to learn this one
so lets again get back over there so as you
see in the whole image you would be getting
down the image first and then the first objective
over there is to break it down into some number
of segments right
so these are the different segments to which
it comes and lets say that this is breaking
down an image into its salient segments now
once you have broken down the image into salient
segments the next part is that you are going
to identify some of these segments or what
is also called as an objectification task
now once you are able to objectify these blocks
over there then the next part is trying to
identify some of the detect whether there
are humans or not ok so you find out that
they have a certain number of humans over
there
then the machine is able to find out that
there are certain number of inanimate objects
as well now humans are in any way targets
are different one story now once it knows
that there are humans over there it will try
to recognize humans find out who is who actually
which portion is over there on the image so
there was a human figure but that each person
is that human who is being shown over here
now once we are able to do all of this then
you can describe a whole scene coming down
from that one
and this is essentially what what this machine
is able to do but the question is even bigger
the question is that we know that how it was
learning was by doing something of this sort
and the deeper it keeps on going so that is
over the hierarchy as it keeps on going which
is it starts with the base of the pyramid
which is on the salient segment and then it
keeps on climbing climbing climbing up to
the description of the scene so as it keeps
on growing up that hierarchy this is where
the depth of the whole learning comes into
it and
now the aspect of deep learning says that
as its able to go down so its obviously gaining
this depth by gaining looking at more number
of images getting down more and more experience
and accordingly its its performance is increasing
now the point is that you see some sort of
a hierarchical nature in which this whole
recognition task is based now is this hierarchical
nature really unique or is it non unique its
the major question which we have as of now
so ah i would give you a few seconds to actually
ponder on this one whether its unique or not
do you think there can be a non unique way
of solving this problem as well instead of
this being the only possible way of solving
the problem so that brings us to an actual
pertinent challenge as it turns out that this
is not a very unique way of solving the problem
and there can be multiple ways of it
now lets look into a thing lets make a parallel
universe in which we are able to create a
different rationally different model over
there so so initially what i would do is i
would make a replica of this itself now let
me just remove certain of these connections
and reorder the blocks itself ok now once
i have reordered i can connect them and if
you carefully look over here the number of
blocks as well as the objective of each of
these blocks is still the same whereas the
order in which the blocks were connected somehow
got changed over here now if you carefully
introspect on this one you would see that
within these blocks its still the same so
there there isnt any change coming down over
here as you look into over here what happens
is that you can still put down image it will
break it down into salient segments in the
earlier case we were objectifying first but
here it detects humans in the segments then
goes on to objectify what they are and then
goes on to recognize
and some of you can even say that we can pull
down this recognizer below the object if i
then push the object away above then have
the inanimate things over there and thats
thats perfectly fine i mean thats also another
possibility of doing it so as you see what
happens is essentially it turns out that there
is no unique way of solving this problem and
thats not a very happy thing to be here
so as researchers for us its a very interesting
point because we know that there can be multiple
ways of solving a problem so it means that
every time i find out one way of solving the
problem i get down a paper route and there
can be a phd theses out of it know then from
a product development perspective its really
really a very dicey situation because if you
have non unique ways of solving a problem
that means you will have to explode down each
and every possibility of solving out that
problem and find out which is the best possible
solution in order to achieve a solution to
this problem
this possible method to achieve a solution
to that problem and now this this is it the
only way of doing and and and this this problem
this this challenge which we have over here
that there can be non unique solutions and
we can always keep on proposing yet another
way of solving this problem and yet more another
way of in fact there are certain interesting
papers which do come out in conferences which
called as yet another way of solving this
problem [laughter]
so in in terms of an industry this is really
a major issue because over there as you see
these non unique ways of solving the problems
creep up you would see that these are bigger
challenges for industry itself and now one
was that this arrangement of these blocks
over there that was becoming as non unique
but then the point is is that the only issue
which comes out or or can there be some other
ways of doing it as well so as it turns out
this is not the only challenge
which you face in fact if we consider one
of these blocks say detect humans that can
have one way in which what we can do is i
get down the salient segments over there and
then say that objectified the saying that
there is some sort of object and lets see
whether there is a human over there or not
so we take down all segments over there from
the objectify block and then what we do is
we go around very extracting some features
on this small block itself ok so that should
be enough to say whether there is so this
is a simple image classification problem human
or not a human ok now i can find features
like wavelets l b p s and histogram of oriented
gradients these are the ones which we have
already studied and then what we do is from
there using each of them we can do some sort
of a body part recognition or run down one
classifier which can identify which which
parts over here so in this patch we can divide
it into number of small segments and say which
segment is what so i can identify head leg
hands over there
and this is a body part recognition now once
i have my body part recognition what you can
do is between these body parts i can draw
down lines and find out what are the distance
relationship between these parts and now for
a standard human there would be a range of
these distance relationships the angles in
which they vary and then using that these
measures i can again run another classifier
and find out whether this appearance match
matches down that of a human or not
because see there can be other bipedal animals
which may not be human so there can be an
orangutan there can be say even a kangaroo
which are bipeds as such now but then the
ratio of length of their hands to legs they
are very pretty different over there then
the posture because i they they dont always
stand upright ok there can be a bear which
is standing upright but then the ratios are
different the posture is different the distance
between the legs and the hands are different
the the angles at which these things are connected
they are also pretty different and thats what
makes it easy to classify now using this whole
thing i can find out that there is a human
apparently it turns out this is not dears
the unique way of doing what i can do is i
can have another way in which i can take down
this segment and then on the segment what
i can do is run down a romo clustering and
a posterior line ok so as a human i can be
standing upright i can lean down i can have
twisted views any of these and and chroma
clustering is what will help you find out
which is the background and which is the foreground
now once you have all of this together you
can do some sort of a silhouette matching
which is like if human beings are present
over there in black and whites this is what
the shadows or outlines would look like ok
now so a cat dog or monkey or a goat and their
solutes will be very different from that one
now i can run down some sort of a distance
algorithm on top of it and do a classification
task which is to get down my humans over there
and recognize humans
so as you see over here also we have two different
ways and two non unique ways of detecting
humans and as it turns out that you can have
some infinite number of non unique ways of
detecting humans itself and this is not such
a easy thing this is actually raises to a
huge dilemma within practice and and that
dilemma does not just exist in the field of
machine vision so this dilemma is there in
the field of speech and signal processing
as well so where what can be a way of recognizing
speech and and apparently it turns out that
there is no unique way of doing it
it exists in the field of handwritten digit
recognition as well so if you are writing
down something then how can we identify it
exists in the field of language national language
processing which is from your sentences can
you make inferences out of it or say today
what you write down on google you dont anymore
put down some keywords to query down you ask
a full question as if you are asking so a
human being in front of you
so what is the weather today will it rain
today so it has to bring break it down into
computer readable instructions so if there
is if you ask what is the weather today so
it knows that it has to put down get todays
date and generate a query to our website on
weather may be weather dot com and then send
out todays date you are location over here
and ask this weather dot com to give a feedback
and then it will have to reconstruct a sentence
back onto it and that thats about natural
language processing
now from there there are interesting problems
on hierarchical and transfer learning as well
and so we would eventually go down a bit later
on into what this transfer learning and hierarchical
learning is all about and it it does exist
in the field of medical imaging and image
analysis itself so these these non uniqueness
dilemma while its a interesting avenue and
scope for researchers was for a longer duration
of time but today if you see with the advent
of deep learning over here
so this this avenue is sort of closed and
what we come down to is lets come down to
the most consistent solution available by
discovering the solution through a learning
method and eventually using all of this discovered
solution can be make a analysis and can be
say or what is called as the explained ability
of this learnt model says what is more of
a research challenge today
but then going down that we have these dilemmas
then the objective is how do we tackle this
dilemma and for that whats done is something
of that sort so say you have this image captioning
problem over there so what i can do is i can
take an image i can organize all the pixels
of the image into a vector and then i can
subsequently keep on connecting these through
subsequent nodes over there and now finally
what it would do is that there is it would
generate some sort of an output which would
say that there is a great wall behind and
there is a great wall logo beside and there
is these there are these four people over
there
and if you look into this one what what this
does is this is some sort of a network like
architecture which says that you give in all
the pixels and from pixels it will translate
to some alternate representations by clubbing
all the pixels together into one representation
than another and then subsequently as it goes
down it it follows down a hierarchy and finally
comes down to a classification or associating
certain labels over there
now carefully getting back this model actually
very closely represents into what is called
as a multi layer perceptron ok so within a
multi layer perceptron what happens so this
is what we have learned in the earlier lecture
on on simple neural networks so we will get
down into exactly what how the mathematics
of multi layer perceptron is handled but before
that ah within this what happens is that you
get down all the input pixels over here it
will translate via a multiple number of neuron
layers and each of this is what is called
as a hidden layer the reason its hidden is
that its not hidden from the coat but it says
because you dont see any output coming out
of these layers like its its no target output
which comes out the target output only comes
out of this last layer which is also called
as output layer
so these output layer and the input layer
to which you give an input and you draw an
output from is what is called as the visible
layers and inside all of these intermediate
layer through which the whole operation and
mathematics project is is what what are called
as the hidden layers over there now as you
get a multi layer perceptron what comes down
is that you will also have to train a multi
layer perceptron
now in order to train a multi layer perceptron
lets have that same kind of analogy as we
had done in the first as we had done in the
first week on the lecture on neural networks
so you have your weight space and you have
your p or performance or your cost function
j w which you had done now it does have the
same troubles over here as you observed over
there that you can get done multiple number
of minima s and maxima s while the problem
well the whole objective over there was that
say you start down at some random point over
there and then you had to gradient descent
and learn and come down to a minimum
but my point is that every time you have a
different combination of initial weights given
down you would be near a different minimum
position and thats where the challenge is
now apparently it turns out that this isnt
a very weird kind of a thing there is a whole
reasoning and why as to why so do you have
that kind of a behavior over there so what
turns out is that if you look into this cost
function curve and then say take down one
of these points what comes out does if you
can get an explanation out of what this deep
neural network was doing in a hierarchical
way
you would get down one of these kind of models
which comes down which is taken image then
get its features extracted on each of the
small segments within that image and then
find out a body reactor so this essentially
is where you have it going down in a hierarchy
so you have to complete this layer of wavelets
plus l b p s plus h o g and then only you
can go down to the layer of body part recognition
and once this layer of body part recognition
is done only then you can go down to the layer
of human record appearance module
and thats what will associate itself to one
of these bits over there now as it turns out
if you can go down to a different initialization
you will have a different model or doing it
up and so on and so forth it turns out that
for every single peak location every single
different kind of a unique minima coming down
over there you will get down a different sort
of a model and this is the major reason why
you really have a trouble or a major issue
in order to analyze and emphasize and and
explain these multi layer perceptrons
so while we have done this one this lecture
does come to an end over here now in the in
the subsequent lecture what we will be doing
is we will so here i have just told you about
the problem and then the issues which come
down with these kind of models of multiple
layers so in the next one i will be discussing
about what are the different kind of layers
there can be and subsequently we will enter
eventually into the math of trying to solve
out in order to get rid of this kind of problems
so with that wait and watch for the next a
subsequent one on the history of deep learning
and learning with deep neural networks and
till then bye
