[MUSIC PLAYING]
SPEAKER 1: All right.
Welcome back, everyone,
to an introduction
to Artificial Intelligence with Python.
Now last time, we took a look at
machine learning-- a set of techniques
that computers can use in
order to take a set of data
and learn some patterns inside of that
data, learn how to perform a task,
even if we, the programmers, didn't
give the computer explicit instructions
for how to perform that task.
Today, we transition to one of the
most popular techniques and tools
within machine learning
that have neural networks.
And neural networks were
inspired as early as the 1940s
by researchers who were thinking
about how it is that humans learn,
studying neuroscience
and the human brain,
and trying to see whether or not we can
apply those same ideas to computers as
well, and model computer
learning off of human learning.
So how is the brain structured?
Well, very simply put, the brain
consists of a whole bunch of neurons,
and those neurons are
connected to one another
and communicate with
one another in some way.
In particular, if you think about
the structure of a biological neural
network-- something like this--
there are a couple of key
properties that scientists observed.
One was that these neurons
are connected to each other
and receive electrical
signals from one another,
that one neuron can propagate
electrical signals to another neuron.
And another point is
that neurons process
those input signals, and then can
be activated, that a neuron becomes
activated at a certain point, and
then can propagate further signals
onto neurons in the future.
And so the question then became, could
we take this biological idea of how it
is that humans learn-- with
brains and with neurons--
and apply that to a
machine as well, in effect,
designing an artificial neural
network, or an ANN, which
will be a mathematical model
for learning that is inspired
by these biological neural networks?
And what artificial neural
networks will allow us to do
is they will first be able to model
some sort of mathematical function.
Every time you look at a neural network,
which we'll see more of later today,
each one of them is really
just some mathematical function
that is mapping certain
inputs to particular outputs,
based on the structure of
the network, that depending
on where we place particular units
inside of this neural network,
that's going to determine how it is
that the network is going to function.
And in particular,
artificial neural networks
are going to lend themselves
to a way that we can learn what
the network's parameters should be.
We'll see more on that in just a moment.
But in effect we want to model,
such that it is easy for us
to be able to write some code
that allows for the network
to be able to figure out how to model
the right mathematical function,
given a particular set of input data.
So in order to create our
artificial neural network,
instead of using
biological neurons, we're
just going to use what
we're going to call units--
units inside of a neural network--
which we can represent kind
of like a node in a graph,
which will here be represented
just by a blue circle like this.
And these artificial units--
these artificial neurons--
can be connected to one another.
So here, for instance,
we have two units that
are connected by this edge inside
of this graph, effectively.
And so what we're going
to do now is think
of this idea as some sort of
mapping from inputs to outputs,
that we have one unit that
is connected to another unit,
that we might think of this side as
the input and that side of the output.
And what we're trying to do then is
to figure out how to solve a problem,
how to model some sort
of mathematical function.
And this might take
the form of something
we saw last time, which
was something like, we
have certain inputs like variables
x1 and x2, and given those inputs,
we want to perform some sort of task--
a task like predicting whether
or not it's going to rain.
And ideally, we'd like some way,
given these inputs x1 and x2,
which stand for some sort of
variables to do with the weather,
we would like to be able
to predict, in this case,
a Boolean classification-- is it going
to rain, or is it not going to rain?
And we did this last time by
way of a mathematical function.
We defined some function h
for our hypothesis function
that took as input x1 and x2--
the two inputs that we cared
about processing-- in order
to determine whether we thought it was
going to rain, or whether we thought it
was not going to rain.
The question then becomes, what does
this hypothesis function do in order
to make that determination?
And we decided last time to use a linear
combination of these input variables
to determine what the output should be.
So our hypothesis function
was equal to something
like this: weight 0 plus weight 1
times x1 plus weight 2 times x2.
So what's going on here
is that x1 and x2--
those are input variables-- the
inputs to this hypothesis function--
and each of those input
variables is being
multiplied by some weight,
which is just some number.
So x1 is being multiplied by weight
1, x2 is being multiplied by weight 2,
and we have this additional
weight-- weight 0--
that doesn't get multiplied
by an input variable
at all, that just serves to either move
the function up or move the function's
value down.
You can think of this as
either a weight that's
just multiplied by some
dummy value, like the number
1 when it's multiplied by 1, and
so it's not multiplied by anything.
Or sometimes you'll
see in the literature,
people call this variable
weight 0 a "bias,"
so that you can think of these
variables as slightly different.
We have weights that are
multiplied by the input
and we separately add some
bias to the result as well.
You'll hear both of
those terminologies used
when people talk about neural
networks and machine learning.
So in effect, what we've
done here is that in order
to define a hypothesis
function, we just need
to decide and figure out
what these weights should be,
to determine what values to multiply by
our inputs to get some sort of result.
Of course, at the end
of this, what we need
to do is make some
sort of classification
like raining or not raining, and to
do that, we use some sort of function
to define some sort of threshold.
And so we saw, for instance, the
step function, which is defined as 1
if the result of multiplying the
weights by the inputs is at least 0;
otherwise as 0.
You can think of this line
down the middle-- it's kind
of like a dotted line.
Effectively, it stays at 0
all the way up to one point,
and then the function steps--
or jumps up-- to 1.
So it's zero before it
reaches some threshold,
and then it's 1 after it
reaches a particular threshold.
And so this was one way
we could define what
we'll come to call an "activation
function," a function that
determines when it is that
this output becomes active--
changes to a 1 instead of being a 0.
But we also saw that if we didn't just
want a purely binary classification,
if we didn't want purely
1 or 0, but we wanted
to allow for some in-between
real number values,
we could use a different function.
And there are a number of choices,
but the one that we looked at was
the logistic sigmoid function that
has sort of an S-shaped curve,
where we could represent
this as a probability--
that may be somewhere in between the
probability of rain of something like
0.5, and maybe a little bit later
the probability of rain is 0.8--
and so rather than just have a
binary classification of 0 or 1,
we can allow for numbers
that are in between as well.
And it turns out there are
many other different types
of activation functions, where
an activation function just
takes the output of multiplying the
weights together and adding that bias,
and then figuring out what
the actual output should be.
Another popular one is the rectified
linear unit, otherwise known ReLU,
and the way that works is
that it just takes as input
and takes the maximum
of that input and 0.
So if it's positive, it remains
unchanged, but i if it's negative,
it goes ahead and levels out at 0.
And there are other activation
functions that we can choose as well.
But in short, each of
these activation functions,
you can just think of as a function
that gets applied to the result of all
of this computation.
We take some function g and apply it to
the result of all of that calculation.
And this then is what we saw
last time-- the way of defining
some hypothesis function
that takes on inputs,
calculates some linear
combination of those inputs,
and then passes it through some sort of
activation function to get our output.
And this actually turns
out to be the model
for the simplest of neural
networks, that we're
going to instead represent this
mathematical idea graphically, by using
a structure like this.
Here then is a neural
network that has two inputs.
We can think of this
as x1 and this as x2.
And then one output, which you can
think of classifying whether or not
we think it's going to rain
or not rain, for example,
in this particular instance.
And so how exactly does this model work?
Well, each of these two inputs
represents one of our input variables--
x1 and x2.
And notice that these
inputs are connected
to this output via
these edges, which are
going to be defined by their weights.
So these edges each have a
weight associated with them--
weight 1 and weight 2--
and then this output unit,
what it's going to do
is it is going to calculate an
output based on those inputs
and based on those weights.
This output unit is going to multiply
all the inputs by their weights,
add in this bias term, which you can
think of as an extra w0 term that
gets added into it, and then we pass
it through an activation function.
So this then is just a graphical
way of representing the same idea
we saw last time, just mathematically.
And we're going to call this
a very simple neural network.
And we'd like for this
neural network to be
able to learn how to
calculate some function,
that we want some function for
the neural network to learn,
and the neural network
is going to learn what
should the values of w0, w1, and w2 be.
What should the activation
function be in order
to get the result that we would expect?
So we can actually take a
look at an example of this.
What then is a very simple
function that we might calculate?
Well, if we recall back from when we
were looking at propositional logic,
one of the simplest
functions we looked at
was something like the or
function, that takes two inputs--
x and y-- and outputs 1, otherwise known
as true, if either one of the inputs,
or both of them, are 1, and outputs a 0
if both of the inputs are 0, or false.
So this then is the or function.
And this was the truth table for
the or function-- that as long
as either of the inputs are 1,
the output of the function is 1,
and the only case where the output of
0 is where both of the inputs are 0.
So the question is, how could we take
this and train a neural network to be
able to learn this particular function?
What would those weights look like?
Well, we could do something like this.
Here's our neural
network, and I'll propose
that in order to
calculate the or function,
we're going to use a value
of 1 for each of the weights,
and we'll use a bias
of negative 1, and then
we'll just use this step function
as our activation function.
How then does this work?
Well, if I wanted to calculate
something like 0 or 0,
which we know to be 0, because
false or false is false, then
what are we going to do?
Well, our output unit
is going to calculate
this input multiplied by the weight.
0 times 1, that's 0.
Same thing here.
0 times 1, that's 0.
And we'll add to that the bias, minus 1.
So that'll give us some
result of negative 1.
If we plot that on our activation
function-- negative 1 is here--
it's before the threshold,
which means either 0 or 1.
It's only 1 after the threshold.
Since negative 1 is
before the threshold,
the output that this unit
provides it is going to be 0.
And that's what we would expect
it to be, that 0 or 0 should be 0.
What if instead we had had 1 or
0, where this is the number 1?
Well, in this case,
in order to calculate
what the output is going to be, we
again have to do this weighted sum.
1 times 1, that's 1.
0 times 1, that's 0.
Sum of that so far is 1.
Add negative 1 to that.
Well, then the output of 0.
And if we plot 0 on the step
function, 0 ends up being here--
it's just at the threshold--
and so the output here
is going to be 1, because the
output of 1 or 0, that's 1.
So that's what we would expect as well.
And just for one more example, if I
had 1 or 1, what would the result be?
Well 1 times 1 is 1.
1 times 1 is 1.
The sum of those is 2.
I add the bias term to that.
I get the number 1.
1 plotted on this graph
is way over there.
That's well beyond the threshold.
And so this output is
going to be 1 as well.
The output is always 0 or 1,
depending on whether or not
we're past the threshold.
And this neural network then models the
or function-- a very simple function,
definitely-- but it still is
able to model it correctly.
If I give it the inputs, it will
tell me what x1 or x2 happens to be.
And you could imagine trying
to do this for other functions
as well-- a function like the
and function, for instance,
that takes two inputs and calculates
whether both x and y are true.
So if x is 1 and y is 1, then
the output of x and y is 1,
but in all of the other
cases, the output is 0.
How could we model that inside
of a neural network as well?
Well, it turns out we could do it in the
same way, except instead of negative 1
as the bias, we can use
negative 2 as the bias instead.
What does that end up looking like?
Well, if I had 1 and 1, that should
be 1, because 1, true and true,
is equal to true.
Well, I take 1 times 1.
That's 1.
1 times 1 is 1.
I got a total sum of 2 so far.
Now I add the bias of negative
2, and I get the value 0.
And 0 when I plotted on the activation
function is just past that threshold.
And so the output is going to be 1.
But if I had any other input,
for example, like 1 and 0, well,
the weighted sum of these is 1 plus 0.
It's going to be 1.
Minus 2 is going to give us
negative 1, and negative 1
is not past that threshold, and
so the output is going to be zero.
So those then are some
very simple functions
that we can model using a neural
network, that has two inputs and one
output, where our goal is
to be able to figure out
what those weights should be in order
to determine what the output should be.
And you could imagine generalizing this
to calculate more complex functions as
well, that maybe given the
humidity and the pressure,
we want to calculate what's the
probability that it's going to rain,
for example.
Or you might want to do a
regression-style problem, where
given some amount of advertising
and given what month it is maybe,
we want to predict what
our expected sales are
going to be for that particular month.
So you could imagine these inputs
and outputs being different as well.
And it turns out that in some problems,
we're not just going to have two
inputs, and the nice thing about these
neural networks is that we can compose
multiple units together-- make
our networks more complex--
just by adding more units into
this particular neural network.
So the network we've been looking
at has two inputs and one output.
But we could just as
easily say, let's go ahead
and have three inputs in there,
or have even more inputs,
where we could arbitrarily
decide, however many inputs there
are to our problem, all going to
be calculating some sort of output
that we care about
figuring out the value of.
How then does the math work
for figuring out that output?
Well, it's going to work
in a very similar way.
In the case of two inputs, we had
two weights indicated by these edges,
and we multiplied the weights by
the numbers, adding this bias term,
and we'll do the same thing
in the other cases as well.
If I have three inputs, you'll imagine
multiplying each of these three inputs
by each of these weights.
If I had five inputs instead,
we're going to do the same thing.
Here, I'm saying sum up from 1 to 5.
xi multiplied by weight i.
So take each of the
five input variables,
multiply them by their corresponding
weight, and then add the bias to that.
So this would be a case where there are
five inputs into this neural network,
for example.
But there could be more
arbitrarily many nodes
that we want inside of
this neural network,
where each time we're
just going to sum up
all of those input variables
multiplied by the weight,
and then add the bias
term at the very end.
And so this allows us
to be able to represent
problems that have even
more inputs, just by growing
the size of our neural network.
Now, the next question we
might ask is a question
about how it is that we train
these internal networks?
In the case of the or
function and the and function,
they were simple enough
functions that I could just
tell you like here what
the weights should be,
and you could probably
reason through it yourself
what the weights should be in order
to calculate the output that you want.
But in general, with functions
like predicting sales or predicting
whether or not it's going to
rain, these are much trickier
functions to be able to figure out.
We would like the computer to have some
mechanism of calculating what it is
that the weights should be--
how it is to set the weights--
so that our neural network is able
to accurately model the function
that we care about trying to estimate.
And it turns out that the
strategy for doing this,
inspired by the domain of calculus, is
a technique called gradient descent.
And what gradient descent is, it
is an algorithm for minimizing loss
when you're training a neural network.
And recall that loss refers to how bad
our hypothesis function happens to be,
that we can define
certain loss functions,
and we saw some examples
of loss functions
last time that just give us a number
for any particular hypothesis,
saying how poorly does
it model the data?
How many examples does it get wrong?
How are they worse or less bad as
compared to other hypothesis functions
that we might define?
And this loss function is
just a mathematical function,
and when you have a
mathematical function,
in calculus, what you
could do is calculate
something known as the gradient, which
you can think of is like a slope.
It's the direction the loss function
is moving at any particular point.
And what it's going to tell
us is in which direction
should we be moving these weights in
order to minimize the amount of loss?
And so generally speaking-- we
won't get into the calculus of it--
but the high-level idea
for gradient descent
is going to look something like this.
If we want to train a
neural network, we'll
go ahead and start just by
choosing the weights randomly.
Just pick random weights for all of
the weights in the neural network.
And then we'll use the input data
that we have access to in order
to train the network
in order to figure out
what the weights should actually be.
So we'll repeat this
process again and again.
The first step is we're going
to calculate the gradient based
on all of the data points.
So we'll look at all the data and figure
out what the gradient is at the place
where we currently are-- for the
current setting of the weights--
which means that in which direction
should we move the weights in order
to minimize the total amount of loss
in order to make our solution better?
And once we've calculated
that gradient--
which direction we should
move in the loss function--
well, then we can just update those
weights according to the gradient,
take a small step in the
direction of those weights
in order to try to make our
solution a little bit better.
And the size of the step that
we take, that's going to vary,
and you can choose that when you're
training a particular neural network.
But in short, the idea is going
to be take all of the data points,
figure out based on those data points in
what direction the weights should move,
and then move the weights one
small step in that direction.
And if you repeat that
process over and over again,
adjusting the weights a little bit at
a time based on all the data points,
eventually, you should end up with
a pretty good solution to trying
to solve this sort of problem.
At least that's what we
would hope to happen.
Now as you look at this
algorithm, a good question
to ask anytime you're
analyzing an algorithm
is, what is going to be the expensive
part of doing the calculation?
What's going to take a lot of
work to try to figure out what
is going to be expensive to calculate?
And in particular, in the
case of gradient descent,
the really expensive part is this
all data points part right here,
having to take all of the data
points and using all of those data
points to figure out what the gradient
is at this particular setting of all
of the weights, because odds are,
in a big machine learning problem
where you're trying to solve a
big problem with a lot of data,
you have a lot of data
points in order to calculate,
and figuring out the gradient
based on all of those data points
is going to be expensive.
And you'll have to do it many times,
but you'll likely repeat this process
again and again and again, going
through all the data points,
taking one small step over and
over, as you try and figure
out what the optimal setting
of those weights happens to be.
It turns out that we
would ideally like to be
able to train our neural networks faster
to be able to more quickly converge
to some sort of solution that is going
to be a good solution to the problem.
So in that case, there are alternatives
to just standard gradient descent,
which looks at all of
the data points at once.
We can employ a method like stochastic
gradient descent, which will randomly
just choose one data point at a time
to calculate the gradient based on,
instead of calculating it based
on all of the data points.
So the idea there is that we
have some setting of the weights,
we pick a data point, and
based on that one data point,
we figure out in which direction
should we move all of the weights,
and move the weights in that small
direction, then take another data point
and do that again, and repeat
this process again and again,
maybe looking at each of the
data points multiple times,
but each time, only using one data
point to calculate the gradient
to calculate which
direction we should move in.
Now just using one data point
instead of all of the data points
probably gives us a
less accurate estimate
of what the gradient actually is.
But on the plus side, it's going to be
much faster to be able to calculate,
that we can much more quickly calculate
what the gradient is, based on one data
point, instead of calculating
based on all of the data points
and having to do all of that
computational work again and again.
So there are trade-offs here between
looking at all of the data points
and just looking at one data point.
And it turns out that a middle ground--
and this is also quite popular--
is a technique called
mini-batch gradient descent,
where the idea there is instead at
looking at all of the data versus just
a single point, we instead divide
our dataset up into small batches--
groups of data points-- where you can
decide how big a particular batch is,
but in short, you're just going to
look at a small number of points
at any given time, hopefully getting a
more accurate estimate of the gradient,
but also not requiring all of
the computational effort needed
to look at every single
one of these data points.
So gradient descent
then is this technique
that we can use in order to train
these neural networks in order
to figure out what the setting
of all of these weights
should be, if we want some way to try
and get an accurate notion of how it is
that this function should work, some
way of modeling how to transform
the inputs into particular outputs.
So far, the networks that
we've taken a look at
have all been structured
similar to this.
We have some number of inputs--
maybe two or three or five or more--
and then we have one output that is
just predicting like rain or no rain,
or just predicting one particular value.
But often in machine learning problems,
we don't just care about one output.
We might care about an output that has
multiple different values associated
with it.
So in the same way that we
could take a neural network
and add units to the input layer,
we can likewise add outputs
to the output layer as well.
Instead of just one output, you
could imagine we have two outputs,
or we could have like four outputs,
for example, where in each case,
as we add more inputs
or add more outputs,
if we want to keep this network fully
connected between these two layers,
we just need to add more weights,
that now each of these input nodes
have four weights associated
with each of the four outputs,
and that's true for each of these
various different input nodes.
So as we add nodes, we
add more weights in order
to make sure that each
of the inputs can somehow
be connected to each of the
outputs, so that each output
value can be calculated based on what
the value of the input happens to be.
So what might a case be where we want
multiple different output values?
Well, you might consider
that in the case of weather
predicting, for example,
we might not just care
whether it's raining or not raining.
There might be multiple
different categories of weather
that we would like to
categorize the weather into.
With just a single output variable,
we can do a binary classification,
like rain or no rain, for instance--
1 or 0-- but it doesn't allow
us to do much more than that.
With multiple output
variables, I might be
able to use each one to predict
something a little different.
Maybe I want to categorize
the weather into one
of four different
categories, something like,
is it going to be raining
or sunny or cloudy or snowy,
and I now have four
output variables that
can be used to represent maybe the
probability that it is raining,
as opposed to sunny, as opposed
to cloudy, or as opposed to snowy.
How then would this neural network work?
Well, we have some input
variables that represent some data
that we have collected
about the weather.
Each of those inputs
gets multiplied by each
of these various different weights.
We have more multiplications
to do, but these
are fairly quick mathematical
operations to perform.
And then what we get
is after passing them
through some sort of activation
function in the outputs,
we end up getting some sort of number,
where that number, you might imagine,
you can interpret as like a
probability, like a probability
that it is one category, as
opposed to another category.
So here we're saying
that based on the inputs,
we think there is a 10% chance that it's
raining, a 60% chance that it's sunny,
a 20% chance of cloudy, a
10% chance of it's snowy.
And given that output, if these
represent a probability distribution,
well, then you could just pick
whichever one has the highest value--
in this case, sunny--
and say that, well,
most likely, we think
that this categorization of inputs
means that the output should be sunny,
and that is what we
would expect the weather
to be in this particular instance.
So this allows us to do these sort
of multi-class classifications,
where instead of just having
a binary classification--
1 or 0-- we can have as many
different categories as we
want, and we can have our neural
network output these probabilities
over which categories are most
more likely than other categories,
and using that data, we're able
to draw some sort of inference
on what it is that we should do.
So this was sort of the idea
of supervised machine learning.
I can give this neural network
a whole bunch of data--
whole bunch of input data--
corresponding to some
label, some output data--
like we know that it
was raining on this day,
we know that it was sunny on that day--
and using all of that
data, the algorithm
can use gradient descent to
figure out what all of the weights
should be in order to create
some sort of model that
hopefully allows us
a way to predict what
we think the weather is going to be.
But neural networks have a lot
of other applications as well.
You can imagine applying
the same sort of idea
to a reinforcement learning
sort of example as well.
Well, you remember that in
reinforcement learning, we wanted to do
is train some sort of agent to learn
what action to take depending on what
state they currently happen to be in.
So depending on the
current state of the world,
we wanted the agent to pick from
one of the available actions that
is available to them.
And you might model that by having
each of these input variables
represent some information
about the state--
some data about what state
our agent is currently in--
and then the output,
for example, could be
each of the various different
actions that our agent could
take-- action 1, 2, 3,
and 4, and you might
imagine that this network
would work in the same way,
that based on these
particular inputs we go ahead
and calculate values for
each of these outputs,
and those outputs could model which
action is better than other actions,
and we could just choose, based
on looking at those outputs, which
actions we should take.
And so these neural networks
are very broadly applicable,
that all they're really doing is
modeling some mathematical function.
So anything that we can frame as
a mathematical function, something
like classifying inputs into
various different categories,
or figuring out based
on some input state what
action we should take-- these are all
mathematical functions that we could
attempt to model by taking advantage
of this neural network structure,
and in particular, taking advantage
of this technique, gradient descent,
that we can use in order to figure out
what the weights should be in order
to do this sort of calculation.
Now how is it that you would go about
training a neural network that has
multiple outputs instead of just one?
Well, with just a
single output, we could
see what the output for
that value should be,
and then you update all of the
weights that corresponded to it.
And when we have multiple outputs,
at least in this particular case,
we can really think of this as
four separate neural networks,
that really we just
have one network here
that has these three inputs,
corresponding with these three weights,
corresponding to this one output value.
And the same thing is true
for this output value.
This output value effectively
defines yet another neural network
that has these same three inputs,
but a different set of weights
that correspond to this output.
And likewise, this output has
its own set of weights as well,
and the same thing for
the fourth output too.
And so if you wanted to train a neural
network that had four outputs instead
of just one, in this case where
the inputs are directly connected
to the outputs, you could
really think of this
as just training four
independent neural networks.
We know what the outputs
for each of these four
should be based on our input
data, and using that data,
we can begin to figure out what all
of these individual weights should be,
and maybe there's an additional
step at the end to make sure
that turn these values into
a probability distribution,
such that we can interpret
which one is better than another
or more likely than another as a
category or something like that.
So this then seems like it does a pretty
good job of taking inputs and trying
to predict what outputs
should be, and we'll
see some real examples of
this in just a moment as well.
But it's important then
to think about what
the limitations of this
sort of approach is,
of just taking some linear
combination of inputs
and passing it into some
sort of activation function.
And it turns out that when we do this
in the case of binary classification--
I'm trying to predict like does it
belong to one category or another--
we can only predict things that are
linearly separable, because we're
taking a linear combination of inputs
and using that to define some decision
boundary or threshold.
Then what we get is a situation
where if we have this set of data,
we can predict a line
that separates linearly
the red points from the blue points.
But a single unit that is
making a binary classification,
otherwise known as a perceptron,
can't deal with a situation like this,
where-- we've seen this
type of situation before--
where there is no
straight line that just
goes straight through the data that
will divide the red points away
from the blue points.
It's a more complex decision boundary.
The decision boundary somehow
needs to capture the things
inside of the circle, and
there isn't really a line
that will allow us to deal with that.
So this is the limitation
of the perceptron--
these units that just make these binary
decisions based on their inputs--
that a single perceptron
is only capable of learning
a linearly separable decision boundary.
It can do is define a line.
And sure, it can give
us probabilities based
on how close to that
decision boundary we are,
but it can only really decide based
on a linear decision boundary.
And so this doesn't seem like it's
going to generalize well to situations
where real-world data is involved,
because real-world data often
isn't linearly separable.
It often isn't the case that we can
just draw a line through the data
and be able to divide it
up into multiple groups.
So what then is the solution to this?
Well, what was proposed was the
idea of a multilayer neural network,
that so far, all of the neural networks
we've seen have had a set of inputs
and a set of outputs, and the inputs
are connected to those outputs.
But in a multi-layer neural network,
this is going to be an artificial
neural network that has an input
layer still, it has an output layer,
but also has one or more
hidden layers in between--
other layers of artificial
neurons, or units, that
are going to calculate
their own values as well.
So instead of a neural
network that looks like this,
with three inputs and one output, you
might imagine, in the middle here,
injecting a hidden layer--
something like this.
This is a hidden layer
that has four nodes.
You could choose how many nodes or units
end up going into the hidden layer,
and you have multiple
hidden layers as well.
And so now each of these inputs isn't
directly connected to the output.
Each of the inputs is connected
to this hidden layer, and then
all of the nodes in the hidden layer,
those are connected to the one output.
And so this is just
another step that we can
take towards calculating
more complex functions.
Each of these hidden units will
calculate its output value,
otherwise known as its activation,
based on a linear combination
of all the inputs.
And once we have values
for all of these nodes,
as opposed to this just being the
output, we do the same thing again--
calculate the output
for this node, based
on multiplying each of the values for
these units by their weights as well.
So in effect, the way this works
is that we start with inputs.
They get multiplied by
weights in order to calculate
values for the hidden nodes.
Those get multiplied by weights
in order to figure out what
the ultimate output is going to be.
And the advantage of layering things
like this is it gives us an ability
to model more complex functions,
that instead of just having a single
decision boundary-- a single line
dividing the red points from the blue
points--
each of these hidden nodes can
learn a different decision boundary,
and we can combine those decision
boundaries to figure out what
the ultimate output is going to be.
And as we begin to imagine
more complex situations,
you could imagine each of these
nodes learning some useful property
or learning some useful
feature of all of the inputs
and somehow learning how to combine
those features together in order to get
the output that we actually want.
Now the natural question, when
we begin to look at this now,
is to ask the question of, how
do we train a neural network
that has hidden layers inside of it?
And this turns out to initially
be a bit of a tricky question,
because the input data we are given
is we are given values for all
of the inputs, and we're given what
the value of the output should be--
what the category is, for example--
but the input data doesn't tell us
what the values for all of these nodes
should be.
So we don't know how far
off each of these nodes
actually is, because we're only given
data for the inputs and the outputs.
The reason this is
called the hidden layer
is because the data that
is made available to us
doesn't tell us what the values
for all of these intermediate nodes
should actually be.
And so the strategy people came up
with was to say that if you know what
the error or the losses
on the output node, well,
then based on what these weights are--
if one of these weights is higher than
another--
you can calculate an estimate for
how much the error from this node
was due to this part of the hidden
node, or this part of the hidden layer,
or this part of the hidden layer,
based on the values of these weights,
in effect saying, that based
on the error from the output,
I can backpropagate the
error and figure out
an estimate for what the error is for
each of these the hidden layer as well.
And there's some more calculus here
that we won't get into the details of,
but the idea of this algorithm
is known as backpropagation.
It's an algorithm for
training a neural network
with multiple different hidden layers.
And the idea for this--
the pseudocode for it--
will again be, if we want to run
gradient descent with backpropagation,
we'll start with a random choice
of weights as we did before,
and now we'll go ahead and repeat
the training process again and again.
But what we're going
to do each time is now
we're going to calculate the
error for the output layer first.
We know the output and what it should
be, and we know what we calculated,
so we figure out what
the error there is.
But then we're going to
repeat, for every layer,
starting with the output layer,
moving back into the hidden layer,
then the hidden layer before that
if there are multiple hidden layers,
going back all the way to
the very first hidden layer,
assuming there are multiple, we're going
to propagate the error back one layer--
whatever the error was from the output--
figure out what the error should be
a layer before that based on what
the values of those weights are.
And then we can update those weights.
So graphically, the way
you might think about this
is that we first start with the output.
We know what the output should be.
We know what output we calculated.
And based on that, we can
figure out, all right,
how do we need to update
those weights, backpropagating
the error to these nodes.
And using that, we can figure out
how we should update these weights.
And you might imagine if
there are multiple layers,
we could repeat this
process again and again
to begin to figure out how all of
these weights should be updated.
And this backpropagation
algorithm is really
the key algorithm that makes
neural networks possible,
and makes it possible to take
these multi-level structures
and be able to train those
structures, depending
on what the values of these
weights are in order to figure out
how it is that we should go about
updating those weights in order
to create some function that is able
to minimize the total amount of loss,
to figure out some good setting of
the weights that will take the inputs
and translate it into the
output that we expect.
And this works, as we said, not
just for a single hidden layer,
but you can imagine multiple hidden
layers, where each hidden layer--
we just defined however
many nodes we want--
where each of the nodes
in one layer, we can
connect to the nodes in the next
layer, defining more and more complex
networks that are able to model more
and more complex types of functions.
And so this type of network is what we
might call a deep neural network, part
of a larger family of
deep learning algorithms,
if you've ever heard that term.
And all deep learning is about is
it's using multiple layers to be
able to predict and be able to
model higher-level features inside
of the input, to be able to figure
out what the output should be.
And so the deep neural network
is just a neural network that
has multiple of these hidden
layers, where we start at the input,
calculate values for this layer,
then this layer, then this layer,
and then ultimately get an output.
And this allows us to be able to
model more and more sophisticated
types of functions, that
each of these layers
can calculate something
a little bit different.
And we can combine that information to
figure out what the output should be.
Of course, as with any
situation of machine learning,
as we begin to make our
models more and more complex,
to model more and more complex
functions, the risk we run
is something like overfitting.
And we talked about
overfitting last time
in the context of overfitting based on
when we were training our models to be
able to learn some sort of decision
boundary, where overfitting happens
when we fit too closely to the
training data, and as a result,
we don't generalize well to
other situations as well.
And one of the risks we run with a
far more complex neural network that
has many, many different
nodes is that we
might overfit based
on the input data; we
might grow over-reliant on certain nodes
to calculate things just purely based
on the input data that doesn't allow us
to generalize very well to the output.
And there are a number of strategies
for dealing with overfitting,
but one of the most popular in
the context of neural networks
is a technique known as dropout.
And what dropout does is it when we're
training the neural network, what we'll
do in dropout, is
temporarily remove units,
temporarily remove
these artificial neurons
from our network, chosen at
random, and the goal here
is to prevent over-reliance
on certain units.
So what generally
happens in overfitting is
that we begin to over-rely on certain
units inside the neural network
to be able to tell us how
to interpret the input data.
What dropout will do is randomly
remove some of these units
in order to reduce the chance that
we over-rely on certain units,
to make our neural
network more robust, to be
able to handle the situations even when
we just drop out particular neurons
entirely.
So the way that might work is
we have a network like this,
and as we're training it,
when we go about trying
to update the weights
the first time, we'll
just randomly pick some percentage of
the nodes to drop out of the network.
It's as if those nodes
aren't there at all.
It's as if the weights associated
with those nodes aren't there at all.
And we'll train in this way.
Then the next time we update the
weights, we'll pick a different set
and just go ahead and train that
way, and then again randomly choose
and train with other nodes that
have been dropped that as well.
And the goal of that is that
after the training process,
if you train by dropping out random
nodes inside of this neural network,
you hopefully end up with a network
that's a little bit more robust, that
doesn't rely too heavily
on any one particular node,
but more generally learns how to
approximate a function in general.
So that then is a look at
some of these techniques
that we can use in order to
implement a neural network, to get
at the idea of taking
this input, passing it
through these various
different layers, in order
to produce some sort of output.
And what we'd like to do now is take
those ideas and put them into code.
And to do that, there are a number
of different machine learning
libraries-- neural network
libraries-- that we can use that
allow us to get access to someone's
implementation of backpropagation
and all of these hidden layers.
And one of the most popular,
developed by Google,
is known as TensorFlow,
a library that we
can use for quickly
creating neural networks
and modeling them and running
them on some sample data
to see what the output is going to be.
And before we actually
start writing code,
we'll go ahead and take a look
at TensorFlow's Playground, which
will be an opportunity
for us just to play around
with this idea of neural
networks in different layers,
just to get a sense for what it is
that we can do by taking advantage
of a neural networks.
So let's go ahead and go into
TensorFlow's Playground, which you can
go to by visiting that URL from before.
And what we're going to do now is we're
going to try and learn the decision
boundary for this particular output.
I want to learn to separate the
orange points from the blue points,
and I'd like to learn some
sort of setting of weights
inside of a neural network that will be
able to separate those from each other.
The features we have
access to, our input data,
are the x value and the y value, so the
two values along each of the two axes.
And what I'll do now is I can set
particular parameters, like what
activation function I would like
to use, and I'll just go ahead
and press Play and see what happens.
And what happens here is that you'll
see that just by using these two input
features-- the x value and the
y value, with no hidden layers--
just take the input, x and y values, and
figure out what the decision boundary
is--
our neural network learns
pretty quickly that in order
to divide these two points,
we should just use this line.
This line acts as the decision boundary
that separates this group of points
from that group of points,
and it does it very well.
You can see up here what the loss is.
The training loss is
zero, meaning we were
able to perfectly model separating
these two points from each other inside
of our training data.
So this was a fairly simple case of
trying to apply a neural network,
because the data is very clean it's
very nicely linearly separable.
We can just draw a line that separates
all of those points from each other.
Let's now consider a more complex case.
So I'll go ahead and
pause the simulation,
and we'll go ahead and
look at this data set here.
This data set is a little
bit more complex now.
In this data set, we still
have blue and orange points
that we'd like to
separate from each other,
but there is no single
line that we can draw
that is going to be able to
figure out how to separate
the blue from the orange, because the
blue is located in these two quadrants
and the orange is located here and here.
It's a more complex function
to be able to learn.
So let's see what happens if we just
try and predict based on those inputs--
the x- and y-coordinates--
what the output should be.
Press Play, and what you'll notice
is that we're not really able
to draw much of a
conclusion, that we're not
able to very cleanly
see how we should divide
the orange points from the
blue points, and you don't
see a very clean separation there.
So it seems like we don't have enough
sophistication inside of our network
to be able to model something
that is that complex.
We need a better model
for this neural network.
And I'll do that by
adding a hidden layer.
So now I have the hidden layer
that has two neurons inside of it.
So I have two inputs that then go to
two neurons inside of a hidden layer
that then go to our output, and now I'll
press Play, and what you'll notice here
is that we're able to
do slightly better.
We're able to now say, all right,
these points are definitely blue.
These points are definitely orange.
We're still struggling a little bit
with these points up here though,
and what we can do is
we can see for each
of these hidden neurons
what is it exactly
that these hidden neurons are doing.
Each hidden neuron is learning
its own decision boundary,
and we can see what that boundary is.
This first neuron is
learning, all right,
this line that seems to
separate some of the blue points
from the rest of the points.
This other hidden neuron
is learning another line
that seems to be separating
the orange points in the lower
right from the rest of the points.
So that's why we're able
to sort of figure out
these two areas in the bottom
region, but we're still not
able to perfectly classify
all of the points.
So let's go ahead and
add another neuron--
now we've got three neurons
inside of our hidden layer--
and see what we're able to learn now.
All right.
Well, now we seem to
be doing a better job
by learning three different
decision boundaries, which
each of the three neurons
inside of our hidden layer
were able to much better figure out
how to separate these blue points
from the orange points.
And you can see what each of
these hidden neurons is learning.
Each one is learning a slightly
different decision boundary,
and then we're combining those
decision boundaries together
to figure out what the
overall output should be.
And we can try it one more time
by adding a fourth neuron there
and try learning that.
And it seems like now
we can do even better
at trying to separate the blue
points from the orange points,
but we were only able to do
this by adding a hidden layer,
by adding some layer that is
learning some other boundaries,
and combining those boundaries
to determine the output.
And the strength-- the size
and thickness of these lines--
and indicate how high these weights
are, how important each of these inputs
is, for making this sort of calculation.
And we can do maybe one more simulation.
Let's go ahead and try this on
a data set that looks like this.
Go ahead and get rid
of the hidden layer.
Here now we're trying to
separate the blue points
from the orange points, where all
the blue points are located, again,
inside of a circle, effectively.
So we're not going to
be able to learn a line.
Notice I press Play,
and we're really not
able to draw any sort of
classification at all,
because there is no line
that cleanly separates
the blue points from the orange points.
So let's try to solve this by
introducing a hidden layer.
I'll go ahead and press Play.
And all right.
With two neurons and
a hidden layer, we're
able to do a little better,
because we effectively learned
two different decision boundaries.
We learned this line here,
and we learned this line
on the right-hand side.
And right now, we're just saying,
all right, well, if it's in-between,
we'll call it blue, and if it's
outside, we'll call it orange.
So, not great, but certainly
better than before.
We're learning one decision boundary
and another, and based on those,
we can figure out what
the output should be.
But let's now go ahead and add a
third neuron and see what happens now.
I go ahead and train it.
And now, using three
different decision boundaries
that are learned by each
of these hidden neurons,
we're able to much more
accurately model this distinction
between blue points and orange points.
We're able to figure out, maybe with
these three decision boundaries,
combining them together, you can imagine
figuring out what the output should be
and how to make that
sort of classification.
And so the goal here
is just to get a sense
for having more neurons in
these hidden layers that
allows us to learn more
structure in the data,
allows us to figure out what the
relevant and important decision
boundaries are.
And then using this
backpropagation algorithm,
we're able to figure out what
the values of these weights
should be in order to
train this network to be
able to classify one category of points
away from another category of points
instead.
And this is ultimately what we're
going to be trying to do whenever
we're training a neural network.
So let's go ahead and actually
see an example of this.
You'll recall from last time that
we had this banknotes file that
included information about
counterfeit banknotes as opposed
to authentic banknotes, where it had
four different values for each banknote
and then a categorization of
whether that bank note is considered
to be authentic or a counterfeit note.
And what I wanted to do was,
based on that input information,
figure out some function
that could calculate
based on the input information
what category it belonged to.
And what I've written
here in banknotes.py
is a neural network that we'll learn
just that, a network that learns,
based on all of the
input, whether or not
we should categorize a banknote
as authentic or as counterfeit.
The first step is the same as
what we saw from last time.
I'm really just reading the data in and
getting it into an appropriate format.
And so this is where more of the
writing Python code on your own
comes in terms of
manipulating this data,
massaging the data
into a format that will
be understood by a
machine learning library
like scikit-learn or like TensorFlow.
And so here I separate it into
a training and a testing set.
And now what I'm doing down below
is I'm creating a neural network.
Here I'm using tf, which
stands for TensorFlow.
Up above I said, import
TensorFlow as tf.
So you have just an abbreviation
that we'll often use,
so we don't need to write out
TensorFlow every time we want
to use anything inside of the library.
I'm using tf.keras.
Keras is an API, a set
of functions that we
can use in order to manipulate
neural networks inside of TensorFlow,
and it turns out there
are other machine learning
libraries that also use the Kersa API.
But here, I'm saying, all
right, go ahead and give me
a model that is a sequential model--
a sequential neural network--
meaning one layer after another.
And now I'm going to add to that
model what layers I want inside
of my neural network.
So here I'm saying, model.add.
Go ahead and add a dense layer--
and when we say a dense
layer, we mean a layer that
is just each of the
nodes inside of the layer
is going to be connected to
each from the previous layer,
so we have a densely connected layer.
This layer is going to have
eight units inside of it.
So it's going to be a hidden layer
inside of a neural network with eight
different units, eight
artificial neurons, each of which
might learn something different.
And I just sort of
chose eight arbitrarily.
You could choose a different number
of hidden nodes inside of the layer.
And as we saw before, depending
on the number of units
there are inside of your
head and layer, more units
means you can learn
more complex functions,
so maybe you can more accurately
model the training data,
but it comes at a cost.
More units means more weights that
you need to figure out how to update,
so it might be more expensive
to do that calculation.
And you also run the risk of overfitting
on the data if you have too many units,
and you learn to just
overfit on the training data.
That's not good either.
So there is a balance, and
there's often a testing process,
where you'll train on some data
and maybe validate how well you're
doing on a separate set of data--
often called a validation set-- to see,
all right, which setting of parameters,
how many layers should
I have, how many units
should be in each layer,
which one of those
performs the best on the validation set?
So you can do some testing to figure out
what these hyperparameters, so-called,
should be equal to.
Next I specify what the input_shape is,
meaning what does my input look like?
My input has four values,
and so the input shape
is just 4, because we have four inputs.
And then I specify what
the activation function is.
And the activation function,
again, we can choose.
There a number of different
activation functions.
Here I'm using relu, which
you might recall from earlier.
And then I'll add an output layer.
So I have my hidden layer.
Now I'm adding one more
layer that will just
have one unit, because all I
want to do is predict something
like counterfeit bill or authentic bill.
So I just need a single unit.
And the activation function
I'm going to use here
is that sigmoid
activation function, which
again was that S-shaped curve that
just gave us like a probability of,
what is the probability that this
is a counterfeit bill as opposed
to an authentic bill?
So that then is the structure of my
neural network-- sequential neural
network that has one hidden layer
with eight units inside of it,
and then one output layer that just
has a single unit inside of it.
And I can choose how
many units there are.
I can choose the activation function.
Then I'm going to compile this model.
TensorFlow gives you a choice of how
you would like to optimize the weights--
there are various different
algorithms for doing that--
what type of loss function
you want to use-- again,
many different options for doing that--
and then how I want
to evaluate my model.
Well, I care about accuracy.
I care about how many of my points
am I able to classify correctly
versus not correctly of
counterfeit or not counterfeit,
and I would like it to report to me
how accurate my model is performing.
Then, now that I've
defined that model, I
call model.fit to say, go
ahead and train the model.
Train it on all the training data,
plus all of the training labels--
so labels for each of those
pieces of training data--
and I'm saying run it for
20 epochs, meaning go ahead
and go through each of these
training points 20 times effectively,
go through the data 20 times and
keep trying to update the weights.
If I did it for more, I
could train for even longer
and maybe get a more
accurate result. But then
after I fit in on all the data,
I'll go ahead and just test it.
I'll evaluate my model
using model.evaluate,
built into TensorFlow, that
is just going to tell me,
how well do I perform
on the testing data?
So ultimately, this is
just going to give me
some numbers that tell me how well
we did in this particular case.
So now what I'm going to
do is go into banknotes
and go ahead and run banknotes.py.
And what's going to
happen now is it's going
to read in all of that trading data.
It's going to generate a neural
network with all my inputs,
my eight hidden layers, or eight
hidden units inside my layer,
and then an output unit, and now
what it's doing is it's training.
It's training 20 times,
and each time, you
can see how my accuracy is
increasing on my training data.
It starts off, the very first
time, not very accurate,
though better than random,
something like 79% of the time,
it's able to accurately
classify one bill from another.
But as I keep training, notice this
accuracy value improves and improves
and improves, until after I've
trained through all of the data points
20 times, it looks like my accuracy
is above 99% on the training data.
And here's where I tested it on
a whole bunch of testing data.
And it looks like in this case,
I was also like 99.8% accurate.
So just using that, I was able
to generate a neural network that
can detect counterfeit
bills from authentic bills
based on this input data
99.8% of the time, at least
based on this particular testing data.
And I might want to
test it with more data
as well, just to be
confident about that.
But this is really the value of
using a machine learning library
like TensorFlow, and there are
others available for Python
and other languages as
well, but all I have to do
is define the structure of the
network and define the data
that I'm going to pass
into the network, and then
TensorFlow runs the
backpropagation algorithm
for learning what all of
those weights should be,
for figuring out how to train
this neural network to be able to,
as accurately as possible, figure
out what the output values should
be there as well.
And so this then was a look at what it
is that neural networks can do, just
using these sequences of
layer after layer after layer,
and you can begin to imagine applying
these to much more general problems.
And one big problem in computing, and
artificial intelligence more generally,
is the problem of computer vision.
Computer vision is all
about computational methods
for analyzing and understanding
images, that you might have pictures
that you want the computer to
figure out how to deal with,
how to process those images,
and figure out how to produce
some sort of useful result out of this.
You've seen this in the context
of social media websites
that are able to look at a photo
that contains a whole bunch of faces,
and it's able to figure out
what's a picture of whom
and label those and tag them
with appropriate people.
This is becoming
increasingly relevant as we
begin to discuss self-driving cars.
These cars now have
cameras, and we would
like for the computer to have
some sort of algorithm that
looks at the images
and figures out, what
color is the light, what cars are around
us and in what direction, for example.
And so computer vision is
all about taking an image
and figuring out what
sort of computation--
what sort of calculation--
we can do with that image.
It's also relevant in the context of
something like handwriting recognition.
This, what you're looking at, is
an example of the MNIST dataset--
it's a big dataset just
of handwritten digits--
that we could use to, ideally,
try and figure out how to predict,
given someone's handwriting, given a
photo of a digit that they have drawn,
can you predict whether it's a 0, 1,
2, 3, 4, 5, 6, 7, 8, or 9, for example.
So this sort of handwriting
recognition is yet another task
that we might want to use computer
vision tasks and tools to be
able to apply it towards.
This might be a task
that we might care about.
So how then can we use
neural networks to be
able to solve a problem like this?
Well, neural networks rely
upon some sort of input,
where that input is just numerical data.
We have a whole bunch of
units, where each one of them
just represents some sort of number.
And so in the context of something
like handwriting recognition,
or in the context of
just an image, you might
imagine that an image is really just
a grid of pixels, a grid of dots,
where each dot has some sort
of color, and in the context
of something like
handwriting recognition,
you might imagine that
if you just fill in each
of these dots in a particular
way, you can generate a 2 or an 8,
for example, based on which dots happen
to be shaded in and which dots are not.
And we can represent each of these
pixel values just using numbers.
So for a particular pixel, for example,
0 might represent entirely black.
Depending on how you're
representing color,
it's often common to represent
color values on a 0-to-255 range,
so that you can represent a color using
eight bits for a particular value,
like how much white is in the image?
So 0 might represent all black,
255 might represent entirely white
as a pixel, and somewhere in between
might represent some shade of gray,
for example.
But you might imagine not just having a
single slider that determines how much
white is in the image, but
if you had a color image,
you might imagine three different
numerical values-- a red, green,
and blue value--
where the red value controls
how much red is in the image,
we have one value for controlling
how much green is in the pixel,
and one value for how much
blue is in the pixel as well.
And depending on how it is that you set
these values of red, green, and blue,
you can get a different color.
And so any pixel can really
be represented in this case
by three numerical values--
a red value, a green
value, and a blue value.
And if you take a whole
bunch of these pixels,
assemble them together inside
of a grid of pixels, then
you really just have a whole
bunch of numerical values
that you can use in order to perform
some sort of prediction task.
And so what you might imagine
doing is using the same techniques
we talked about before.
Just design a neural network with a lot
of inputs, that for each of the pixels,
we might have one or three different
inputs in the case of a color image--
a different input-- that is just
connected to a deep neural network,
for example.
And this deep neural
network might take all
of the pixels inside of the image
of what digit a person drew,
and the output might be like 10
neurons that classify it as a 0 or a 1
or 2 or 3, or just tells us in some
way what that digit happens to be.
Now there are a couple of
drawbacks to this approach.
The first drawback to the approach
is just the size of this input array,
that we have a whole bunch of inputs.
If we have a big image, that
is a lot of different channels
we're looking at-- a lot of inputs,
and therefore, a lot of weights
that we have to calculate.
And a second problem is the fact
that by flattening everything
into just the structure
of all the pixels,
we've lost access to a lot of the
information about the structure
of the image that's
relevant, that really,
when a person looks at
an image, they're looking
at particular features of that image.
They're looking at curves.
They're looking at shapes.
They're looking at what
things can you identify
in different regions of the image,
and maybe put those things together
in order to get a better picture of
what the overall image was about.
And by just turning it into a pixel
values for each of the pixels,
sure, you might be able
to learn that structure,
but it might be challenging
in order to do so.
It might be helpful to take advantage
of the fact that you can use properties
of the image itself-- the fact that
it's structured in a particular way--
to be able to improve the way that
we learn based on that image too.
So in order to figure out how we can
train our neural networks to better
be able to deal with images, we'll
introduce a couple of ideas--
a couple of algorithms-- that we can
apply that allow us to take the images
and extract some useful
information out of that image.
And the first idea we'll introduce
is the notion of image convolution.
And what an image convolution is all
about is it's about filtering an image,
sort of extracting useful or
relevant features out of the image.
And the way we do that is by applying
a particular filter that basically adds
the value for every pixel with the
values for all of the neighboring
pixels to it.
According to some sort of kernel
matrix, which we'll see in a moment,
it's going to allow us to weight these
pixels in various different ways.
And the goal of image
convolution then is
to extract some sort of interesting
or useful features out of an image,
to be able to take a pixel, and
based on its neighboring pixels,
maybe predict some sort of
valuable information, something
like taking a pixel and looking
at its neighboring pixels,
you might be able to
predict whether or not
there's some sort of
curve inside the image,
or whether it's forming the outline
of a particular line or a shape,
for example, and that
might be useful if you're
trying to use all of these
various different features
to combine them to say something
meaningful about an image as a whole.
So how then does image convolution work?
Well, we start with a kernel
matrix, and the kernel matrix
looks something like this.
And the idea of this
is that given a pixel--
that would be the middle pixel--
we're going to multiply each of the
neighboring pixels by these values
in order to get some sort of result by
summing up all of the numbers together.
So if I take this kernel, which
you can think of is like a filter
that I'm going to apply to the image.
And let's say that I take this image.
This is a four-by-four image.
We'll think of it as just a black and
white image, where each one is just
a single pixel value, so somewhere
between 0 and 255, for example.
So we have a whole bunch of
individual pixel values like this,
and what I'd like to do
is apply this kernel--
this filter, so to speak--
to this image.
And the way I'll do that is, all
right, the kernel is three-by-three.
So you can imagine a five-by-five
kernel or a larger kernel too.
And I'll take it and just first apply
it to the first three-by-three section
of the image.
And what I'll do is I'll take
each of these pixel values
and multiply it by its corresponding
value in the filter matrix
and add all of the results together.
So here, for example, I'll say 10 times
0, plus 20, times negative 1, plus 30,
times 0, so on and so forth,
doing all of this calculation.
And at the end, if I
take all these values,
multiply them by their
corresponding value in the kernel,
add the results together, for this
particular set of nine pixels,
I get the value of 10 for example.
And then what I'll do is I'll slide this
three-by-three grid effectively over.
Slide the kernel by one to look at
the next three-by-three section.
And here I'm just sliding
it over by one pixel,
but you might imagine a
different slide length,
or maybe I jump by multiple pixels
at a time if you really wanted to.
You have different options here.
But here I'm just sliding over, looking
at the next three-by-three section.
And I'll do the same math 20 times 0,
plus 30, times a negative 1, plus 40,
times 0, plus 20 times negative 1,
so on and so forth, plus 30 times 5.
And what I end up
getting is the number 20.
Then you can imagine shifting over
to this one, doing the same thing,
calculating like the
number 40, for example,
and then doing the same thing here
and calculating a value there as well.
And so what we have now is
what we'll call a feature map.
We have taken this
kernel, applied it to each
of these various different
regions, and what we get
is some representation of a
filtered version of that image.
And so to give a more concrete example
of why it is that this kind of thing
could be useful, let's
take this kernel matrix,
for example, which is quite a famous
one, that has an 8 in the middle
and then all of the neighboring
pixels that get a negative 1.
And let's imagine we
wanted to apply that
to a three-by-three part of
an image that looks like this,
where all the values are the same.
They're all 20, for instance.
Well, in this case, if you do 20
times 8, and then subtract 20,
subtract 20, subtract 20, for
each of the eight neighbors,
well, the result of that is
you just get that expression,
which comes out to be 0.
You multiply 20 by 8, but
then you subtracted 28 times
according to that particular kernel.
The result of all of that is just 0.
So the takeaway here is that when a
lot of the pixels are the same value,
we end up getting a value close to 0.
If, though, we had something like
this, 20s along this first row,
then 50s in the second row,
and 50s in the third row, well,
then when you do this
same kind of math--
20 times negative 1, 20 times
negative 1, so on and so forth--
then I get a higher value-- a value
like 90, in this particular case.
And so the more general
idea here is that
by applying this kernel,
negative 1s, 8 in the middle,
and then negative 1s, what I get
is when this middle value is very
different from the neighboring values--
like 50 is greater than these 20s--
then you'll end up with
a value higher than 0.
Like if this number is
higher than its neighbors,
you end up getting a bigger output,
but if this value is the same as all
of its neighbors, then you get a
lower output, something like 0.
And it turns out that
this sort of filter
can therefore be used in something
like detecting edges in an image,
or want to detect like the boundaries
between various different objects
inside of an image.
I might use a filter like
this, which is able to tell
whether the value of this pixel
is different from the values
of the neighboring pixel-- if it's like
greater than the values of the pixels
that happened to surround it.
And so we can use this in
terms of image filtering.
And so I'll show you an example of that.
I have here, in filter.py, a file that
uses Python's image library, or PIL,
to do some image filtering.
I go ahead and open an image.
And then all I'm going to do is
apply a kernel to that image.
It's going to be a three-by-three
kernel, the same kind of kernel
we saw before.
And here is the kernel.
This is just a list
representation of the same matrix
that I showed you a
moment ago, with it's
negative 1, negative 1, negative 1.
The second row is
negative 1, 8, negative 1.
The third row is all negative 1s.
And then at the end, I'm going to go
ahead and show the filtered image.
So if, for example, I go
into convolution directory
and I open up an image
like bridge.png, this
is what an input image might look like,
just an image of a bridge over a river.
Now I'm going to go ahead and run
this filter program on the bridge.
And what I get is this image here.
Just by taking the original
image and applying that filter
to each three-by-three
grid, I've extracted
all of the boundaries, all of the
edges inside the image that separate
one part of the image from another.
So here I've got a
representation of boundaries
between particular parts of the image.
And you might imagine that if
a machine learning algorithm is
trying to learn like what an
image is of, a filter like this
could be pretty useful.
Maybe the machine learning
algorithm doesn't care about all
of the details of the image.
It just cares about
certain useful features.
It cares about particular
shapes that are
able to help it determine
that based on the image,
this is going to be a
bridge, for example.
And so this type of idea
of image convolution
can allow us to apply
filters to images that
allow us to extract useful results
out of those images-- taking an image
and extracting its edges, for example.
You might imagine many
other filters that
could be applied to an image that are
able to extract particular values as
well.
And a filter might have separate kernels
for the red values, the green values,
and the blue values that are
all summed together at the end,
such that you could have
particular filters looking for,
is there red in this part of the image?
Are there green in other
parts of the image?
You can begin to assemble these
relevant and useful filters that are
able to do these calculations as well.
So that then was the idea of
image convolution-- applying
some sort of filter to an
image to be able to extract
some useful features out of that image.
But all the while, these
images are still pretty big.
There's a lot of pixels
involved in the image.
And realistically speaking, if
you've got a really big image,
that poses a couple of problems.
One, it means a lot of input
going into the neural network,
but two, it also means
that we really have
to care about what's in each particular
pixel, whereas realistically we often,
if you're looking at an
image, you don't care
whether it's something is
in one particular pixel
versus the pixel immediately
to the right of it.
They're pretty close together.
You really just care
about whether there is
a particular feature in
some region of the image,
and maybe you don't care about
exactly which pixel it happens to be.
And so there's a technique
we can use known as pooling.
And what pooling is, is it means
reducing the size of an input
by sampling from regions
inside of the input.
So we're going to take a big image
and turn it into a smaller image
by using pooling.
And in particular, one of the
most popular types of pooling
is called max-pooling.
And what max-pooling does is it pools
just by choosing the maximum value
in a particular region.
So, for example, let's imagine
I had this four-by-four image,
but I wanted to reduce its dimensions.
I wanted to make an a smaller image, so
that I have fewer inputs to work with.
Well, what I could do is I
could apply a two-by-two max
pool, where the idea
would be that I'm going
to first look at this
two-by-two region and say, what
is the maximum value in that region?
Well, it's the number 50.
So we'll go ahead and
just use the number 50.
And then we'll look at
this two-by-two region.
What is the maximum value here?
110.
So that's going to be my value.
Likewise here, the maximum
value looks like 20.
Go ahead and put that there.
Then for this last
region, the maximum value
was 40, so we'll go ahead and use that.
And what I have now is
a smaller representation
of this same original
image that I obtained just
by picking the maximum value
from each of these regions.
So again, the advantages
here are now I only
have to deal with a two-by-two
input instead of a four-by-four,
and you can imagine shrinking
the size of an image even more.
But in addition to that,
I'm now able to make
my analysis independent of
whether a particular value was
in this pixel or this pixel.
I don't care if the 50 was here or here.
As long as it was
generally in this region,
I'll still get access to that value.
So it makes our algorithms a
little bit more robust as well.
So that then is pooling--
taking the size of the
image and reducing it
a little bit by just sampling from
particular regions inside of the image.
And now we can put all of these ideas
together-- pooling, image convolution,
neural networks-- all together into
another type of neural network called
a convolutional neural network, or a
CNN, which is a neural network that
uses this convolution step, usually
in the context of analyzing an image,
for example.
And so the way that a convolutional
neural own network works is that we
start with some sort of input
image-- some grid of pixels--
but rather than immediately put
that into the neural network layers
that we've seen before, we'll start
by applying a convolution step, where
the convolution step involves applying
a number of different image filters
to our original image
in order to get what
we call a feature map, the result
of applying some filter to an image.
And we could do this once,
but in general, we'll
do this multiple times getting a
whole bunch of different feature
maps, each of which might extract
some different relevant feature out
of the image, some different
important characteristic of the image
that we might care about using in order
to calculate what the result should be.
And in the same way to when
we train neural networks,
we can train neural networks to learn
the weights between particular units
inside of the neural networks.
We can also train neural networks to
learn what those filters should be--
what the values of the
filters should be--
in order to get the most useful,
most relevant information out
of the original image just by figuring
out what setting of those filter
values-- the values
inside of that kernel--
results in minimizing the loss
function and minimizing how poorly
our hypothesis actually performs
in figuring out the classification
of a particular image, for example.
So we first apply this convolution step.
Get a whole bunch of these
various different feature maps.
But these feature maps are quite large.
There is a lot of pixel
values that happen to be here.
And so a logical next step
to take is a pooling step,
where we reduce the size of these
images by using max-pooling,
for example, extracting the maximum
value from any particular region.
There are other pooling
methods that exist
as well, depending on the situation.
You could use something
like average-pooling,
where instead of taking the
maximum value from a region,
you take the average value from a
region, which has it uses as well.
But in effect, what pooling will do
is it will take these feature maps
and reduce their dimensions,
so that we end up
with smaller grids with fewer pixels.
And this then is going to be
easier for us to deal with.
It's going to mean fewer inputs
that we have to worry about,
and it's also going to mean we're
more resilient, more robust,
against potential movements of
particular values just by one pixel,
when ultimately, we really don't care
about those one pixel differences that
might arise in the original image.
Now after we've done this pooling step,
now we have a whole bunch of values
that we can then
flatten out and just put
into a more traditional neural network.
So we go ahead and
flatten it, and then we
end up with a traditional
neural network that
has one input for each of these values
in each of these resulting feature
maps after we do the convolution
and after we do the pooling step.
And so this then is the general
structure of a convolutional network.
We begin with the image,
apply convolution,
apply pooling, flatten the
results, and then put that
into a more traditional neural network
that might itself have hidden layers.
You can have deep
convolutional networks that
have hidden layers in between this
flattened layer and the eventual output
to be able to calculate various
different features of those values.
But this then can help us to be
able to use convolution and pooling,
to use our knowledge about
the structure of an image,
to be able to get better results, to
be able to train our networks faster
in order to better capture
particular parts of the image.
And there's no reason necessarily why
you can only use these steps once.
In fact, in practice, you'll often use
convolution and pooling multiple times
in multiple different steps.
So what you might imagine doing
is starting with an image,
first applying convolution
to get a whole bunch of maps,
then applying pooling, then
applying convolution again,
because these maps are still pretty big.
You can apply convolution to try
and extract relevant features
out of this result.
Then take those results,
apply pooling in order to reduce
their dimensions, and then take that
and feed it into a neural network
that maybe has fewer inputs.
So here, I have two different
convolution and pooling steps.
I do convolution and
pooling once, and then I
do convolution and pooling a
second time, each time extracting
useful features from the layer
before it, each time using
pooling to reduce the dimensions of
what you're ultimately looking at.
And the goal now of this sort of
model is that in each of these steps,
you can begin to learn
different types of features
of the original image, that
maybe in the first step
you learn very low-level features, just
learn and look for features like edges
and curves and shapes, because based
on pixels in their neighboring values,
you can figure out, all
right, what are the edges?
What are the curves?
What are the various different
shapes that might be present there?
But then once you have a
mapping that just represents
where the edges and curves
and shapes happen to be,
you can imagine applying
the same sort of process
again to begin to look for higher-level
features-- look for objects,
maybe look for people's
eyes in facial recognition,
for example, maybe look at more
complex shapes like the curves
on a particular number if you're trying
to recognize a digit in a handwriting
recognition sort of scenario.
And then after all of
that, now that you have
these results that represent
these higher-level features,
you can pass them into a
neural network, which is really
just a deep neural network that looks
like this, where you might imagine
making a binary classification, or
classifying into multiple categories,
or performing various different
tasks on this sort of model.
So convolutional neural networks can
be quite powerful and quite popular
when it comes to trying
to analyze images.
We don't strictly need them.
We could have just used a vanilla neural
network that just operates with layer
after layer as we've seen before.
But these convolutional
neural networks can
be quite helpful, in particular,
because of the way they
model the way a human
might look at an image,
that instead of a human
looking at every single pixel
simultaneously and trying to involve all
of them by multiplying them together,
you might imagine that
what convolution is really
doing is looking at various
different regions of the image
and extracting relevant
information and features out
of those parts of the image
the same way that a human might
have visual receptors that are looking
at particular parts of what they see,
and using those, combining
them, to figure out
what meaning they can draw from all
of those various different inputs.
And so you might imagine applying
this to a situation like handwriting
recognition.
So we'll go ahead and see
an example of that now.
I'll go ahead and open
up handwriting.py.
Again, what we do here is
we first import TensorFlow.
And then, TensorFlow, it
turns out, has a few datasets
that are built in--
built into the library
that you can just immediately access.
And one of the most famous
datasets in machine learning
is the MNIST dataset, which is just
a dataset of a whole bunch of samples
of people's handwritten digits.
I showed you a slide of
that a little while ago.
And what we can do is just
immediately access that dataset,
which is built into the library, so that
if I want to do something like train
on a whole bunch of digits, I can just
use the dataset that is provided to me.
Of course, if I had my own
dataset of handwritten images,
I can apply the same idea.
I'd first just need to take those images
and turn them into an array of pixels,
because that's the way that
these are going to be formatted.
They're going to be
formatted as, effectively,
an array of individual pixels.
And now there's a bit of
reshaping I need to do,
just turning the data into
a format that I can put
into my convolutional neural network.
So this is doing things like taking all
the values and dividing them by 255.
If you remember, these color
values tend to range from 0 to 255.
So I can divide them by 255, just
to put them into a 0-to-1 range,
which might be a little
bit easier to train on .
And then doing various other
modifications to the data, just
to get it into a nice usable format.
But here's the interesting
and important part.
Here is where I create the
convolutional neural network-- the CNN--
where here I'm saying, go ahead
and use a sequential model.
And before I could use model.add to say
add a layer, add a layer, add a layer,
another way I could define
it is just by passing
as input to the sequential neural
network a list of all of the layers
that I want.
And so here, the very
first layer in my model
is a convolutional
layer, where I'm first
going to apply convolution to my image.
I'm going to use 13 different filters,
so my model is going to learn--
32, rather-- 32 different
filters that I would
like to learn on the input
image, where each filter is
going to be a three-by-three kernel.
So we saw those
three-by-three kernels before,
where we could multiply each value
in a three-by-three grid by value,
multiply it and add all
the results together.
So here I'm going to learn 32 different
of these three-by-three filters.
I can again specify my
activation function.
And I specify what my input shape is.
My input shape in the
banknotes case was just 4.
I had four inputs.
My input shape here is going to be 28,
comma, 28, comma 1, because for each
of these handwritten
digits, it turns out
that the MNIST dataset
organizes their data.
Each image is a 28-by-28 pixel grid.
They're going to be a 28-by-28 pixel
grid, and each one of those images only
has one channel value.
These handwritten digits
are just black and white,
so it's just a single
color value representing
how much black or how much white.
You might imagine that in a color image,
if you were doing this sort of thing,
you might have three
different channels-- a red,
a green, and a blue
channel, for example.
But in the case of just handwriting
recognition and recognizing a digit,
we're just going to use a single value
for shaded-in in or not shaded-in,
and it might range, but it's
just a single color value.
And that then is the very first
layer of our neural network,
a convolutional layer
that will take the input
and learn a whole bunch
of different filters
that we can apply to the input
to extract meaningful features.
The next step is going to be a
max-pooling layer, also built
right into TensorFlow,
where this is going
to be a layer that is going to
use a pool size of two by two,
meaning we're going to look at
two-by-two regions inside of the image,
and just extract the maximum value.
Again, we've seen why
this can be helpful.
It'll help to reduce
the size of our input.
Once we've done that, we'll go ahead
and flatten all of the units just
into a single layer
that we can then pass
into the rest of the neural network.
And now, here's the rest
of the whole network.
Here, I'm saying, let's add a hidden
layer to my neural network with 128
units-- so a whole bunch of hidden
units inside of the hidden layer--
and just to prevent overfitting,
I can add a dropout to that-- say,
you know what?
When you're training, randomly drop
out half from this hidden layer,
just to make sure we don't become
over-reliant on any particular node.
We begin to really generalize and
stop ourselves from overfitting.
So TensorFlow allows us,
just by adding a single line,
to add dropout into our model as
well, such that when it's training,
it will perform this
dropout step in order
to help make sure that we don't
overfit on this particular data.
And then finally, I add an output layer.
The output layer is going
to have 10 units, one
for each category, that I would
like to classify digits into,
so 0 through 9, 10 different categories.
And the activation function
I'm going to use here
is called the softmax
activation function.
And in short, what the softmax
activation function is going to do
is it's going to take
the output and turn it
into a probability distribution.
So ultimately, it's
going to tell me, what
did we estimate the probability is
that this is a 2 versus a 3 versus a 4,
and so it will turn it into that
probability distribution for me.
Next up, I'll go ahead
and compile my model
and fit it on all of my training data.
And then I can evaluate how well
the neural network performs.
And then I've added
to my Python program,
if I've provided a command line
argument, like the name of a file,
I'm going to go ahead and
save the model to a file.
And so this can be quite useful too.
Once you've done the
training step, which
could take some time, in
terms of taking all the time--
going through the data; running
backpropagation with gradient descent;
to be able to say, all
right, how should we adjust
the weight to this particular model--
you end up calculating
values for these weights,
calculating values for
these filters, and you'd
like to remember that information,
so you can use it later.
And so TensorFlow allows us to
just save a model to a file,
such that later if we want to
use the model we've learned,
use the weights that we've learned,
to make some sort of new prediction
we can just use the model
that already exists.
So what we're doing here is after
we've done all the calculation,
we go ahead and save the
model to a file, such
that we can use it a little bit later.
So for example, if I go into digits,
I'm going to run handwriting.py.
I won't save it this time.
We'll just run it and go
ahead and see what happens.
What will happen is we need
to go through the model
in order to train on all of these
samples of handwritten digits.
So the MNIST dataset gives
us thousands and thousands
of sample handwritten
digits in the same format
that we can use in order to train.
And so now what you're seeing
is this training process,
and unlike the banknotes
case, where there was much,
much fewer data points--
the data was very, very simple--
here, the data is more complex, and
this training process takes time.
And so this is another
one of those cases where
when training neural networks,
this is why computational power is
so important, that oftentimes,
you see people wanting
to use a sophisticated GPUs in
order to more efficiently be
able to do this sort of
neural network we're training.
It also speaks to the reason
why more data can be helpful.
The more sample data
points you have, the better
you can begin to do this training.
So here we're going through
60,000 different samples
of handwritten digits.
And I said that we're going
to go through them 10 times.
So we're going to go through the
dataset 10 times, training each time,
hopefully improving upon
our weights with every time
we run through this dataset.
And we can see over here on
the right what the accuracy is
each time we go ahead and run
this model, that the first time,
it looks like we got an accuracy
of about 92% of the digits
correct based on this training set.
We increased that to 96% or 97%.
And every time we run
this, we're going to see,
hopefully, the accuracy improve,
as we continue to try and use
that gradient descent, that process
of trying to run the algorithm
to minimize the loss that we get
in order to more accurately predict
what the output should be.
And what this process is doing is
it's learning not only the weights,
but it's learning the
features to use-- the kernel
matrix to use-- when performing
that convolution step, because this
is a convolutional neural network,
where I'm first performing
those convolutions, and then doing
the more traditional neural network
structure.
This is going to learn all of
those individual steps as well.
So here, we see the TensorFlow provides
me with some very nice output, telling
me about how many seconds are left
with each of these training runs,
that allows me to see
just how well we're doing.
So we'll go ahead and see
how this network performs.
It looks like we've gone
through the dataset seven times.
We're going through an eighth time now.
And at this point, the
accuracy is pretty high.
We saw we went from 92% up to 97%.
Now it looks like 98%.
And at this point, it seems like
things are starting to level out.
There's probably a limit to
how accurate we can ultimately
be without running the
risk of overfitting.
Of course, with enough nodes, you could
just memorize the input and overfit
upon them.
But we'd like to avoid doing that
and dropout will help us with this.
But now, we see we're almost
done finishing our training step.
We're at 55,000.
All right.
We've finished training,
and now it's going
to go ahead and test for
us on 10,000 samples.
And it looks like on the testing
set, we were 98.8% accurate.
So we ended up doing
pretty well, it seems,
on this testing set to
see how accurately can
we predict these handwritten digits.
And so what we could do then
is actually test it out.
I've written a program called
recognition.py using PyGame.
If you pass it a model
that's been trained,
and I pre-trained an example model
using this input data, what we can do
is see whether or not
we've been able to train
this convolutional neural network
to be able to predict handwriting,
for example.
So I can try just like
drawing a handwritten digit.
I'll go ahead and draw like
the number 2, for example.
So there's my number 2.
Again, this is messy.
If you tried to imagine how would you
write a program with just like ifs
and thens to be able to do
this sort of calculation,
it would be tricky to do so.
But here, I'll press
Classify, and all right.
It seems it was able to correctly
classify that what I drew
was the number 2.
We'll go ahead and reset it.
Try it again.
We'll draw like an 8, for example.
So here is an 8.
I'll press Classify.
And all right.
It predicts that the digit
that I drew was an 8.
And the key here is this
really begins to show
the power of what the neural
network is doing, somehow looking
at various different features
of these different pixels,
figuring out what the
relevant features are,
and figuring out how to combine
them to get a classification.
And this would be a difficult task
to provide explicit instructions
to the computer on how to do, like
to use a hole punch of if-thens
to process all of these
pixel values to figure out
what the handwritten digit is,
like everyone is going to draw
their 8 a little bit differently.
If I drew the 8 again, it would
look a little bit different.
And yet ideally, we want to
train a network to be robust
enough so that it begins to
learn these patterns on its own.
All I said was, here is the
structure of the network,
and here is the data on
which to train the network,
and the network learning
algorithm just tries
to figure out what is the
optimal set of weights,
what is the optimal
set of filters to use,
in order to be able
to accurately classify
a digit into one category or another.
That's going to show the power of
these convolutional neural networks.
And so that then was a look at how we
can use convolutional neural networks
to begin to solve problems with regards
to computer vision, the ability to take
an image and begin to analyze it.
And so this is the type
of analysis you might
imagine that's happening
in self-driving cars that
are able to figure out what filters to
apply to an image to understand what it
is that the computer is looking
at, or the same type of idea that
might be applied to facial
recognition and social media
to be able to determine how to
recognize faces in an image as well.
You can imagine a neural network
that, instead of classifying
into one of 10 different digits, could
instead classify like, is this person A
or is this person B, trying to
tell those people apart just based
on convolution.
And so now what we'll take a look at
is yet another type of neural network
that can be quite popular
for certain types of tasks.
But to do so, we'll try to generalize
and think about our neural network
a little bit more abstractly, that here
we have a sample deep neural network,
where we have this input layer, a
whole bunch of different hidden layers
that are performing certain
types of calculations,
and then an output layer here that
just generates some sort of output
that we care about calculating.
But we could imagine representing
this a little more simply, like this.
Here is just a more abstract
representation of our neural network.
We have some input.
That might be like a vector of a whole
bunch of different values as our input.
That gets passed into
a network to perform
some sort of calculation or
computation, and that network
produces some sort of output.
That output might be a single value.
It might be a whole bunch
of different values.
But this is the general structure of
the neural network that we've seen.
There is some sort of input
that gets fed into the network,
and using that input, the network
calculates what the output should be.
And this sort of model
for an all network
is what we might call a
feed-forward neural network.
Feed-forward neural networks have
connections only in one direction;
they move from one layer to
the next layer to the layer
after that, such that the inputs pass
through various different hidden layers
and then ultimately produce
some sort of output.
So feed-forward neural networks
are very helpful for solving
these types of classification
problems that we saw before.
We have a whole bunch of input.
We want to learn what
setting of weights will allow
us to calculate the output effectively.
But there are some limitations
on feed-forward neural networks
that we'll see in a moment.
In particular, the input
needs to be of a fixed shape,
like a fixed number of neurons
are in the input layer,
and there's a fixed
shape for the output,
like a fixed number of
neurons in the output layer,
and that has some
limitations of its own.
And a possible solution to this--
and we'll see examples of
the types of problems we
can solve for this in just the second--
is instead of just a feed-forward
neural network where there are only
connections in one direction,
from left to right effectively,
across the network, we can also
imagine a recurrent neural network,
where a recurrent
neural network generates
output that gets fed back into itself as
input for future runs of that network.
So whereas in a
traditional neural network,
we have inputs that get fed into the
network that get fed into the output,
and the only thing that determines the
output is based on the original input
and based on the calculation we
do inside of the network itself,
this goes in contrast with
a recurrent neural network,
where in a recurrent neural
network, you can imagine output
from the network feeding back
to itself into the network
again as input for the next time
that you do the calculations
inside of the network.
What this allows is it allows the
network to maintain some sort of state,
to store some sort of
information that can
be used on future runs of the network.
Previously, the network
just defined some weights,
and we passed inputs through the
network, and it generated outputs,
but the network wasn't saving any
information based on those inputs
to be able to remember for future
iterations or for future runs.
What a recurrent neural
network will let us do
is let the network
store information that
gets passed back in as input to the
network again the next time we try
and perform some sort of action.
And this is particularly helpful
when dealing with sequences of data.
So we'll see a real-world example
of this right now actually.
Microsoft has developed an
AI known as the CaptionBot,
and what the CaptionBot
does is it says, I
can understand the
content of any photograph,
and I'll try to describe
it as well as any human.
I'll analyze your photo, but
I won't store it or share it.
And so what Microsoft CaptionBot
seems to be claiming to do
is it can take an image and
figure out what's in the image
and just give us a
caption to describe it.
So let's try it out.
Here, for example, is an
image of Harvard Square
and some people walking in front of
one of the buildings at Harvard Square.
I'll go ahead and take
the URL for that image,
and I'll paste it into
CaptionBot, then just press Go.
So CaptionBot is analyzing
the image, and then it says,
I think it's a group of
people walking in front
of a building, which seems amazing.
The eye is able to look at this image
and figure out what's in the image.
And the important
thing to recognize here
is that this is no longer
just a classification task.
We saw being able to classify images
with a convolutional neural network,
where the job was to take the images
and then figure out, is it a 0, or a 1,
or a 2; or is that this person's
face or that person's face?
What seems to be happening
here is the input is an image,
and we know how to get networks
to take input of images,
but the output is text.
It's a sentence.
It's a phrase, like "a group of people
walking in front of a building."
And this would seem to pose a
challenge for our more traditional
feed-forward neural networks,
for the reason being
that in traditional
neural networks, we just
have a fixed-size input
and a fixed-size output.
There are a certain number of neurons
in the input to our neural network
and a certain number of
outputs for our neural network,
and then some calculation
that goes on in between.
But the size of the inputs--
the number of values in the input and
the number of values in the output--
those are always going to be fixed based
on the structure of the neural network,
and that makes it difficult to
imagine how a neural network can
take an image like
this and say, you know,
it's a group of people walking
in front of the building,
because the output is text.
It's a sequence of words.
Now it might be possible for a
neural network to output one word.
One word, you could represent
us like a vector of values,
and you can imagine ways of doing that.
And next time, we'll talk
a little bit more about AI
as it relates to language
and language processing.
But a sequence of words
is much more challenging,
because depending on
the image, you might
imagine the output is a
different number of words.
We could have sequences
of different lengths,
and somehow we still want to be able
to generate the appropriate output.
And so the strategy here is to
use a recurrent neural network,
a neural network that can feed
its own output back into itself
as input for the next time.
And this allows us to do what we call
a one-to-many relationship for inputs
to outputs, that in vanilla, more
traditional neural networks--
these are what we consider to
be one-to-one neural networks--
you pass in one set of values as
input, you get one vector of values
as the output--
but in this case, we want to
pass in one value as input--
the image-- and we want to
get a sequence-- many values--
as output, where each value is like
one of these words that gets produced
by this particular algorithm.
And so the way we might do this
is we might imagine starting
by providing input the image
into our neural network,
and the neural network is
going to generate output,
but the output is not going to
be the whole sequence of words,
because we can't represent
the whole sequence of words.
I'm using just a fixed set of neurons.
Instead, the output is just
going to be the first word.
We're going to train
the network to output
what the first word of
the caption should be.
And you could imagine
that Microsoft has trained
to this by running a whole bunch
of training samples through the AI,
giving it a whole bunch of pictures
and what the appropriate caption was,
and having the AI begin
to learn from that.
But now, because the
network generates output
that can be fed back
into itself, you can
imagine the output of the network
being fed back into the same network--
this here looks like a separate network,
but it's really the same network that's
just getting different input--
that this network's output
gets fed back into itself,
but it's going to
generate another output,
and that other output is going to be
like the second word in the caption.
And this recurrent neural
network then, this network
is going to generate other
output that can be fed back
into itself to generate
yet another word, fed back
into itself to generate another word.
And so recurrent neural
networks allow us to represent
this sort of one-to-many structure.
You provide one image as
input, and the neural network
can pass data into the
next run of the network,
and then again and again, such that you
could run the network multiple times,
each time generating a different output,
still based on that original input.
And this is where
recurrent neural networks
become particularly useful when dealing
with sequences of inputs or outputs.
My output is a sequence of words,
and since I can't very easily
represent outputting an
entire sequence of words,
I'll instead output that
sequence one word at a time,
by allowing my network
to pass information
about what still needs to
be said about the photo
into the next stage of
running the networks.
So you could run the
network multiple times--
the same network with the same weights--
just getting different input each time,
first getting input from the image,
and then getting input
from the network itself,
as additional information
about what additionally
needs to be given in a
particular caption, for example.
So this then is a
one-to-many many relationship
inside of a recurrent neural network.
But it turns out there
are other models that we
can use-- other ways we can try and
use recurrent neural networks-- to be
able to represent data that might
be stored in other forms as well.
We saw how we could use neural
networks in order to analyze images,
in the context of convolutional
neural networks that take an image,
figure out various different
properties of the image,
and are able to draw some sort
of conclusion based on that.
But you might imagine that
something like YouTube,
they need to be able to do a
lot of learning based on video.
They need to look
through videos to detect
if there are copyright
violations, or they
need to be able to look through
videos to maybe identify
what particular items are inside
of the video, for example.
And video, you might imagine,
is much more difficult
to put it as input to a neural
network, because whereas an image
you can just treat each pixel is a
different value, videos are sequences.
They're sequences of images, and each
sequence might be a different length,
and so it might be
challenging to represent
that entire video as a
single vector of values
that you could pass in
to a neural network.
And so here too,
recurrent neural networks
can be a valuable solution for
trying to solve this type of problem.
Then instead of just passing in a
single input into our neural network,
we could pass in the input one
frame at a time, you might imagine,
first taking the first frame of the
video, passing it into the network,
and then maybe not having the
network output anything at all yet.
Let it take in another input, and
this time, pass it into the network,
but the network gets
information from the last time
we provided an input into the network.
Then we pass in a third input
and then a fourth input,
where each time, with the network
gets it gets the most recent input,
like each frame of
the video, but it also
gets information the network processed
from all of the previous iterations.
So on frame number
four, you end up getting
the input for frame number four,
plus information the network is
calculated from the first three frames.
And using all of that data combined,
this recurrent neural network
can begin to learn how to extract
patterns from a sequence of data
as well.
And so you might imagine
if you want to classify
a video into a number
of different genres,
like an educational video, or a music
video, or different types of videos.
That's a classification
task, where you want
to take input each of
the frames of the video,
and you want to output
something like what it is
and what category that
it happens to belong to.
And you can imagine doing
this sort of thing--
this sort of many-to-one learning--
anytime your input is a sequence.
And so input is a sequence
in the context of a video.
It could be in the context of like,
if someone has typed a message,
and you want to be able to
categorize that message,
like if you're trying
to take a movie review
and trying to classify it as is it a
positive review or a negative review.
That input is a sequence
of words, and the output
is a classification--
positive or negative.
There too, a recurrent
neural network might
be helpful for analyzing
sequences of words,
and they're quite popular when it
comes to dealing with language.
It could even be used
for spoken language
as well, that spoken language
is an audio waveform that
can be segmented into distinct
chunks, and each of those
can be passed in as an input
into a recurrent neural network
to be able to classify
someone's voice, for instance,
if you want to do voice recognition,
to say is this one person
or is this another?
Here are also cases where you might
want this many-to-one architecture
for a recurrent neural network.
And then as one final
problem, just to take a look
at in terms of what we can do,
with these sorts of networks,
imagine what Google Translate is doing.
So what Google Translate is doing is
it's taking some text written in one
language and converting it into
text written in some other language,
for example, where now this
input is a sequence of data--
it's a sequence of words--
and the output is a
sequence of words as well.
It's also a sequence.
So here, we want effectively
like a many-to-many relationship.
Our input is a sequence, and our
output is a sequence as well.
And it's not quite going to work to
just say, take each word in the input
and translate it into
a word in the output,
because ultimately, different languages
put their words in different orders,
and maybe one language uses
two words for something,
whereas another language only uses one.
So we really want some way to take
this information-- that's input--
encode it somehow, and use that encoding
to generate what the output ultimately
should be.
And this has been one
of the big advancements
in automated translation
technology is the ability
to use own networks to do this, instead
of older, more traditional methods,
and this has improved
accuracy dramatically.
And the way you might
imagine doing this is, again,
using a recurrent neural network with
multiple inputs and multiple outputs.
We start by passing in all the input.
Input goes into the network.
Another input, like another
word, goes into network,
and we do this multiple times, like
once for each word in the input
that I'm trying to translate.
And only after all of that
is done, does the network now
start to generate output, like the
first word of the translated sentence,
and the next word of the translated
sentence, so on and so forth,
where each time the
network passes information
to itself by allowing for this
model of giving some sort of state
from one run in the
network to the next run,
assembling information
about all the inputs,
and then passing in information about
which part of the output in order
to generate next.
And there are a number of
different types of these sorts
of recurrent neural networks.
One of the most popular is known as
the long short-term memory neural
network, otherwise known as LSTM.
But in general, these types of
networks can be very, very powerful
whenever we're dealing with
sequences, whether those
are sequences of images or especially
sequences of words when it comes
towards dealing with natural language.
So that then were just some of the
different types of neural networks
that can be used to do all
sorts of different computations,
and these are incredibly
versatile tools that
can be applied to a number
of different domains.
We only looked at a couple of the most
popular types of neural networks--
the more traditional
feed-forward neural networks,
convolutional neural networks,
and recurrent neural networks.
But there are other types as well.
There are adversarial networks, where
networks compete with each other
to try and be able to
generate new types of data,
as well as other networks that can solve
other tasks based on what they happen
to be structured and adapted for.
And these are very powerful
tools in machine learning,
from being able to very easily learn
based on some set of input data
and to be able to therefore
figure out how to calculate
some function, from inputs to outputs.
Whether it's input to some sort of
classification, like analyzing an image
and getting a digit, or
machine translation where
the input is in one language
and the output is in another,
these tools have a lot of applications
for machine learning more generally.
Next time, we'll look at
machine learning and AI
in particular in the
context of natural language.
We talked a little bit about
this today, but looking
at how it is that our AI can begin
to understand natural language
and can begin to be able to
analyze and do useful tasks with
regards to human
language, which turns out
to be a challenging
and interesting task.
So we'll see you next time.
