Hello, world. It's Siraj and today
we're going to generate words, so given some book or
Movie script or any kind of text corpus you can it's plug in play
So you can give it any?
Kind of text corpus it will learn how to generate words in the style of that corpus of text and in this case
We're going to give it a book the book is called metamorphosis by Franz Kafka
Which was a really crazy weird writer from the 20th century anyway? He's cool. Dude anyway
We're going to generate words in the style of that book and this can be applied to any text any type of text
It doesn't just have to be words it can be code it can be ht know HtML
Whatever it is, but that's what we're going to do. No libraries. Just numpy so I'm going to go through the derivation
the forward propagation
Calculating the loss all the math so get ready for some math put on your linear Algebra and calculus hats okay?
So this is kind of what it looks like this first image here
And I'm going to actually code it as well, so it's I'm going to I'm not just going to glaze over
I'm going to code it so we can see the outputs as I go, but the very end part will be
I'm going to code the important parts
Let me just say that okay. So okay, so check it out given some text. Corpus. It will predict the next character
So what you're seeing here? Is it actually printed it predicting the next word, but we're going to do a character level recurrent network
So that means it's going to generate
character by Characters okay, so character by character by character
not word by Word by Word
okay, so
It's going to be trained for a thousand iterations, and the more you train it the better
It's going to get so if you leave this thing running overnight on your laptop
Then by the time you wake up, it'll be really good however
I wouldn't recommend training it on your laptop as my
Song says I train mom models in the cloud now because my laptop takes longer right so what is a recurrent network
Right, what is this? What is this thing? We've talked about feed-forward networks. I've got two images here of feed-forward networks
The first image is the most popular image right, it's that
Really funky looking neuronal
Architecture, but it's you know it can be kind of confusing if you think about it because it's not like these neurons are classes and
These classes have links to all of the the other neurons like it's some kind of linked
You know massive crazy linked list kind of thing no, it's or tree like thing. It's not really like that
What's really happening are a series of Matrix operations, so these?
Neurons are the output of a series of Matrix up these neurons are actually just numbers that we then activate with an activation function
So a better way of looking at it would be as a computation graph a more
Mathematically sound way of looking at it
So if you have some inputs and
you know the input could be anything what you would do is you would multiply the input by the weight Matrix add a
Bias value
And then activate the result of that
And that would be your output that you then feed in to the next layer a layer
That what you see as these neurons a layer is actually just the result of a dot product operation
Followed by adding a bias value if you want to add a bias which you should in practice
You should add a bias a built neural networks without biases before for examples
But you really should add a bias and I'll talk about why in a second
But you should add a bias value and then you activate the output of that and by activate
I mean you take the output of that dot product plus bias operation the output of that and you feed it into a an
activation function A
Non-linearity whether that's a sigmoid or tan h or rectifies linear unit and the reason we do that is so that our network can learn?
both Linear and
Nonlinear functions because Neural Networks are universal function approximator x' if we didn't apply an activation function to it
It would only be able to learn linear functions
We're going to learn Nonlinear and linear functions
And that's why we apply an activation function to it now a great way to remember this
Whole thing is to just rap about it, so input
Tom's weight add a bias
Activate repeat here. We go input sing with me x weight add a bias
Activate repeat, and you just do that for every layer. You just repeat that process
Okay, so people or networks
They're great for learning an input-output Pattern what is their what is the rule here between these of inputs and a set of outputs?
Right and in the end a feed-forward network and in fact all neural networks. It's just a 1 it's just one big
Composite function. What do I mean by that?
I mean that you can think of a neural network as a giant function and inside of that network are smaller functions
Nested functions a composite function is a function that consists of other functions
What do I mean by nested functions remember this computation graph that we just looked at right here?
These are all functions each layer is a function right input times weights add a bias
Activate that is a function that you feed the output of to as the input to the next function. So what a neural network is
so the
Most nested function right in the middle is this most nested function is the first layer value whose output
We then feed to the next layer which would be the next function whose output we feed to the next layer so the largest
Function the the most the the function on the on the outside here is then the output layer because we're feeding it the inputs output
The output of what of all that that chain of computation that already occurred, right?
So that's what that is, so it's a composite function
And we would use P4 nets anytime we have
Two variables that are related temperature location hide and wait car speed and brand these are all mappings
But what if does has done?
I'm just adding my own sound effects don't add um what if the ordering of the data mattered, right?
What if you had stock prices right?
It's a very controversial topic, but I just you know the stock price thing gets a lot of views
Even though I you know I don't want to I don't want to I don't really personally care about finance data
But I know some of you guys do and you know I'll probably talk about it more in the future
anyway
Tangent back to this what if the time matters right so stock prices are a great example of when time matters?
You can just predict a mapping between time and the price what happened before
what the stock prices before are what matter to the current stock price that's actually available in the in the context of
stock prices
But it applies to all time series data so video right if you want to generate the next frame in a video it
Matters, what frames came before it?
You can't just learn a mapping between a frame and the time that that frame shows up because then what happens is
Given some new time you can just generate a frame based on nothing else
right it depends on the frames that came before it you see what I'm saying the sequence matters the sequence Matters here the alphabets or
lyrics of a song you can just generate a
Lyric or an alphabet depending on the index that it's at you've got to know what came before it try to recite
the alphabet backwards and so to get into neuroscience for a second try to recite the alphabet backwards it's hard right Zui x
W ki Qu okay? See I can't even do it right now
I'm not going to edit that out
So or any song try to recite a song backward you can't because you learned it in a sequence
It's a kind of conditional memory what you remember depends on what you've what you've stored previously, right?
It's conditional memory in that way and that is what recurrent networks help us do they help us compute
Conditional memories they help us compute the next value in a sequence of values
So that's what recurrent networks are good at that's what they're made for and it's not like it
So there's some new technology recurrent networks are invented
They were invented in the 80s neural networks were invented in the 50s, but why is this super hot right now?
Why are you watching this video because with the invention of bigger Data and bigger computing power?
When you take these recurrent networks and give them those two things
They blow almost every other machine learning model out of the water in terms of accuracy, so it's just incredible anyway
this is a picture of a three-layer and they could
This is a picture of a three-layer recurrent network right, so you've got your first layer
Which is your input your hidden state your output layer, and so that would just be a feed-forward network
But the difference here is that we've got this other
Layer right here, and that is so what that what that other layer is it's not actually another layer the difference
Is that we've added a third weight Matrix. So we've got our first weight made our second wave Matrix
We've got a third wave Matrix and that's really what makes it different than a feed-forward network
Is that we're adding a third wave Matrix and what the third weight Matrix is doing it is connecting the current hidden states
So the hidden state at the current time step to the hidden state at the previous time step
So if they were current weight Matrix and you'll see programmatically and mathematically what I'm talking about
But that's really the key bit here for recurrent networks. That's what makes it unique from
Feed-Forward networks and so what this does is whenever we feed in some value
You know because we are
Training this network right we continuously feed it new Data points data point after data point from our training set but for feed-forward
Networks we're only feeding in the input. We're not feeding in that the previous hidden states
We're only feeding an input after input after input and the hidden state is being updated at every time step
But because we want to remember a sequence of data
we're going to not just feed in the
Current Data point wherever we are we're also going to feed it in the previous hidden state and that is
And by that I mean the values that are computed from the previous time step for that hidden states right that set of numbers that
Matrix and
So you might be thinking wait a second?
Why don't we just feed in the input and in the previous input as well from the previous time set why are we feeding in?
The input in the previous hidden state because input
Recurrence only remembers what just happened that previous input, but if you feed it in the previous hidden state
Then what's happening? Is it can remember that sequence? It's not just about what came right before it
it can remember everything that came before it because you can think of that hidden state as a kind of like a
Like think of it like clay that's being molded by every new input. It's being molded by every new input and by feeding that clay
that's being molded back into the network it's it's being a
It's learning neural memory, so it's a form of neural memory conditional memory, and it can remember sequential data
so
Right so here's another example just to give a few more examples before I go into the code here
But we have so this this is a very popular type of image for recurrent networks
So what's happening? Is it's we're feeding in the current input
Calculating a hidden state and then computing an output and then for the next time step
We're giving it that new Data points as well, and so the blue arrow is what's different here compared to a feed-forward network
We're feeding in the previous hidden state as well the input to compute the current hidden states to compute our
output our y value and we're using a loss function to improve our network every time and
So if you think of what that recurrence looks like it looks like this so remember that feed-forward network
We just looked at the difference here. Is that we are feeding in the output of the hidden state back
into the input the output of this wait times bias
Activate operation that's in a layer, okay?
So of the formula for a recurrent network looks like this which basically says that the current hidden state
Ht is A function of the previous hidden state and the current input okay and the theta value
Right here are the parameters of the function so the network learns to use h of t as a lossy summary of the task relevant?
Aspects of the pass sequence of inputs up to t
The Loss function that we're going to use here is going to be the negative log Likelihood, okay?
This is a very popular loss function or recurrent networks look like plain old recurrent networks
Not using anything fancy like long Short-term memory cells or bi-directional
Capabilities, but the negative log Likelihood usually give us the best output or the best accuracy for plain old
Recurrent networks which is why we're going to use it and we'll talk about what that consists of in a second
but our steps for this are going to be the first initialize our weights randomly like we always do and
Then give it then we're going to give the model a char pair
so what is the char pair the Char pair is going to be the input char so that's some seed some some letter from the
training text that we want to give us input as
Well as a as the target char and the Target Char is going to be our label
So our label is actually the next char, so if we take the first two Chars from some input texts from some Korkis
Let's say the word is the the input Char would be t then the target Char would be h so given t?
we want to predict h so you see how that that target Char acts as our as our label that we're trying to predict and
So once we once we have that h we can compute the most likely next character and then compare
From our forward paths we're going to calculate the probability for every possible next Char given that t
According to the state of the model using the parameters
and then we're going to measure our error as a
Distance between the Previous probability value and the Target Char so that that's that's what axes are in our our
label the next char in the sequence
And we just keep doing that so it's a dynamic error right and so we once we have that error value
We'll use it to help us calculate
To help us calculate our gradients for each of our parameters to see the impact they have on the loss and that is back propagation
Through time and we call it through through time because we are using that that hidden states a hidden state Matrix that recurrent
Matrix value, but otherwise, it's just the same. It's just back propagation
It's called through time because we are applying that some
Hidden Say 2 hidden state Matrix to it ok so then once we have our gradient values
We're going to update all the parameters in the direction via the great in the right direction to minimize the loss
That's great in the sense V are gradients, and we just keep repeating that process
So everything is the same here gradient descent as a fee for Network gradient descent?
calculating an error value a
Forward pass for the difference is that we are connecting the current hidden states to the previous hidden state and that changes how?
the Network learns
So what are some use cases?
I talked about time series prediction specifically weather forecasting yet stock prices traffic volume sequential data generation as well
music video Audio any kind of sequential data
What is the next note the next Audio waveform the next frame in the video ok and then for other examples?
I've got great one a great one here for binary audition
That was originally invented by track who is a great technical writer
Definitely check that out and so once we understand the intuition behind recurrent networks then we can move on to LS TM
Networks and bi-directional networks and recursive networks those are more advanced
Networks and that you they solve some problems with recurrent networks before you get there
You've got to understand recurrent networks, okay?
So this code contains four parts the first part is for us to load the training data then we'll define our network
Then we'll define our loss function and the loss function is going to contain both the forward pass and the backward pass
so the real meat of the code is going to happen in the loss function and what it's going to do is it's going to
Return the gradient values that we can then use to update our weights later on during training
but the meat of the code is going to happen in the loss function and once we
We've defined that will write a function to then make
Predictions which in this case would be to generate words and will train the network as well
ok so our first step is going to be load up our training Data, so
to load up our training data the
To load up our training data. I'm going to say ok so let's define what that what that data is by the way
So if we open this file, we'll look at it Casa Calcutta TsT. It's one morning
When Gregor Samsa woke up from trouble dreams with a right?
So this is just a book it's a big book a big txt file
That's what it is the input is going to be ok so
We'll open it up using the native functions here of python and
It's going to be recursive because we want to we want all of it
we'll just read that simple plain txt file, and then we're going to say ok let's get that a
List of Data points and our Chars in this case
We'll store it in charge, and we'll define how big our data is as well as our vocab
our boat capsized and we can say that the
it's going to be the length of the data that that big text file as well as the length of the
Chars
How many chars do we have and we'll print it out for ourselves just so we know
how many cards are our and once we've done that we can go ahead and
Print it out
And it's going to tell us how many unique char there are which matters to us because we want to
make a vector of the size of the number of Chars that there are so let me go ahead and print that out and
it's going to tell us exactly what the deal is and so we've got a
That's how many characters. It has okay, so the data has
137 K characters, and eighty-one of them are unique, okay good
Good to know good to know our next step is to calculate the vocab size
Okay, so we're going to calculate the boat capsized because we want to be able to feed vectors into our network
We can't just feed in raw
String, you know Chars
We've got to convert the Chars to
vectors because a vector is an array of a float values in this case or a vector is an array a
list of numbers in the context of Machine learning and so
So we'll calculate the vocab size to help us do this so we're going to create two
dictionaries and both of these dictionaries are going to convert the
both of these dictionaries are going to convert the
Characters to integers and then the integers to characters while respectively one we'll convert from character to integer
Which is the one that I've just written and then the next one is going to say let's
convert the integers to Characters and
Once we've done that we can go ahead and say well
Let's print all of the values that it's it's storing because these are our dictionary that we're going to use in a second to
convert our values into vectors, so let's go ahead and print that and
What's the deal here? Oh and numerate write a new?
breaks great great
Right so here are our vectors right there
It's a dictionary or here here are dictionaries one for characters two integers and one for integers two characters, okay?
So once we have that now
so we've done that already and
so then we're going to say
let's
Create a vector for character a so this is what vectorization looks like for us
So let's say we want to create a vector for the character a so we'll say it will initialize the vector
It's empty so it's just a vector of zeros of the size of the vocab okay?
and so of the size of our vocab and then we'll say okay, so so convert the
Knot now, we're going to do the conversion. We'll say
Char to integer so a to the integer
So that's so that's going to be our input
it's going to give us an integer value and
We're going to set it to one and so what happens is when we print out this vector
It's going to be a vector of size
Let's see if I got that right there's going to be a vector of size
Hold on all right important um pie. I forgot importing numpy. Yeah, okay
Right so it's a vector of size. How many unique characters were there there were?
81 Unique Characters
so the vector of size 81 and all of those values all those elements in the vector are going to be 0
Except for the one that is
The mapping between a and its respective integer in that dictionary, so that's how we mapped it
That's why we created those two dictionaries
So this is what we would feed in as a so we will feed it in two of these because remember I said that we
Have a char pair. So we'll feed an a and whatever the next character is as
Our input, which would be our input in our label value and the label is our other character our next character
Ok so then for our model parameters. We're going to define our network remember it's a 3 layer network
We have our input layer our hidden layer and our output layer
And so all these layers are fully connected so that means every value is going to be connected to every other value
Between layers ok so the way we'll define that is
2
Well first, let's define our hyper parameter, but I got it
We got to define our tuning knobs for the network so we want to say that our network and I have a hundred
hidden Neurons words for it
Neurons for it's hidden layer, and then we're going to say that we want
There going to be 25 characters that are
Generated at every time step that's our sequence length and then our learning rate is going to be this very small number
Because if it's too slow then it's never going to Converge, but if it's too high then it will
Overshoot and it's never going to converge the learning rates by the way is a is
How quickly a network abandons old beliefs for new ones?
So if you're training your neural network on cat dog images the lower the learning rate the less likely it will be to
When given a new dog picture if your just been trying it on cat pictures if you give it a new dog picture the less
Likely, it'll be to consider that as a part of the training data
You're kind of be able to recognize both the lower the learning rate the more likely it will consider that dog picture
Just an anomaly and just kind of discard that internally so it's a kind of way of to tune how
Quickly a network diao abandons old beliefs for new ones that's another way of putting it
Anyway, so that's for our hyper parameters
Now we'll define our model parameters right? We've defined our model parameters, and now we can define our
Network's wave values so the first set of weights are going to be from our inputs so x so w x h so x is
our input, this is what the
Terminology is right so the weights from our input to our hidden states, right?
so that's going to be initialized randomly using the numpy Random random Brand and
function and it will be with value between the hidden size that we've
Defined and the vocab size because those are the two values
That we're dealing with here, and we'll multiply it by 0.01 because we just want to scale it
for a character level recurrent network because it's a character level recurrent network so input to hidden state and
So then we will
Repeat that process, but this time
for our not from our input to hidden but for our hidden state to our next hidden state and
So that's our recurrent weight Matrix right there that's our current weight Matrix and so lastly we'll have our third
Matrix which is our
What's what's our third way Matrix our third wave Matrix is our
Hidden states to our outputs value our output and so that's going to be vocab size
to between between the vocab size and the hidden size and
Then we will also since we have two bias bias these will say
The bias for Hidden state will be initialized as a set of zeros
Of size of the hidden size because it's for our hidden state, and then we will
So that's our hidden bias and one more bias and that is for our our
Output by that is our output bias also a collection of zeros
The difference here is that is of the vocab sighs?
Okay, and so yeah great. Oh
Let's see what we got here insides is not defined
in size is
right here
compile compile
What's still hidden size is not defined?
Yes, it is
Yes, it is
Invalid Syntax
crate
so the function is going to take as
its input a list of input Chars a list of Target Chars and the previous hidden state and
Then this functions going to help put a loss a gradient for each parameter between layers and in the last hidden state
So what is the forward task so the forward path in a recurrent network looks like this this function describes the forward pass or at?
This function describes how the hidden state is calculated?
right, so
So so how is the forward task calculated so the forward task is remember?
It's just a series of Matrix operations, so this is this is basically our forward path right here what you're looking at right here
So this first equation right here is
What is let me make this smaller so you can see so the way we compute this math operation right here?
This is the forward pass is the dot product between the input to hidden state weight Matrix and the input data
That's this term right here plus the dot product between the hidden states
the Hidden State the Hidden State Matrix and
The Hidden States and then we add the hidden bias and that's going to give us the hidden state
Value at the current time step right so that's what that represents, and then we take that value and we feed it
We compute a dot product with the next weight Matrix and that is the hidden state to the output and then we add that
out that output
Bias value
And that's going to give us the unnormal
Unnormalized log probabilities for the next charge which we then squash into probability values using the this this function
p
Which is actually right here p right here, but I'll talk about that in a second, okay
So that's our forward pass and then for our backward pass the backward pass is going to be
Before I talk about the backward pass. Let's talk about the loss per second
so the loss is the negative log, likelihood, so it's the negative log value of p and p is this function here, so
Which is represented programmatically by this?
right here, right, so it's the
it's e to the x where x is the output value from the that it received divided by the sum of all of the
E to the
probability values
Okay, and that's going to give us p a p-value, okay
And so we take that p-value and then we take the negative log of that p-value and that is our loss
Scalar that lost scalar value, and so once we have that loss. We're going to we're going to perform back propagation using that loss
And so the way we compute back propagation to go over. This is by using the chain rule
So the chain rule is from calculus what we want to do is compute gradient for each of the layers, okay?
So for each of the weight Matrices okay given an error value
We're going to compute the partial derivative of the error with respect to each weight
Recursively, so the reason we're using the chain rule is that so so because we have three wave matrices
We have the input to hidden hidden to output and hidden to hidden we want to compute
gradient values for all three of those so that's what this looks like
We want to compute gradient values for all three of those weight matrices and the way we're going to do that is to compute our
Loss using the negative log Likelihood and use that laws to compute the partial
Derivative with respect to each of these weight Matrices and once we have those though. That's our gradient value
That's the change that's the Delta
We can then update all three weight Matrices at once and we just keep doing that over and over again so our
First gradient of our loss is going to be computed using this using this function
So compute p minus one and that's going to give us our first gradient
And we're going to use the chain rule to backward pass that gradient into each
Into each weight Matrix so let me talk about what I mean by this so the chain rule so remember remember
How I said Neural networks are giant composite functions, right?
it's a giant composite function and what the chain rule lets us do is that lets us compute the
Derivative of a giant of a function as a salt as the product of derivatives of its nested functions
so chain rule in the case of f of x right here if f of x is a composite function that
that consists g of h of x
then the chain rule would be to say well let's compute the derivative of g of h of x
times the derivative of h of x at nested function
So you multiply it by the derivative of the inside function and that will give you the derivative of that bigger function?
Okay, and you keep doing that for as many nested functions as you have here's another example if I want to derive the function 3x
Plus 1 to the fifth then I would say well, this is actually a function and the function is
The outer function g of x is 3x plus 1 to the fifth so so we're using the power rule. We take the
Exponent value move it to the coefficient and subtract one from the exponent
So then it would be 5 times 3 of x plus 1 to the fourth
Times the derivative of the nested function which is 3 of 3x plus 1 and that's the chain rule?
and so then if we multiply those two derivatives together that will give us a derivative of the larger function f of x
So that same logic applies to Neural networks because Neural networks are composite functions. So we are
recursively moving this derivative this partial derivative value by moving
I mean multiplying dot product or computing the dot product between the partial derivative calculated at the last layer
And we're multiplying it by every layer recursively going backward this will make more sense as we look at this programmatically
but that's what's happening here, and
Yeah, that's what's happening here
So let's let's let's let's code the cell by the way the bias the reason we add a bias
Is it allows you to move the thing of it as like this you know in the y equals Mx plus the equation it?
Allows you to move the line up and down to better fit the Data without B
The Line will always go through the origin 0 0 and you might get a poorer fit
So a bias is kind of like an anchor value
Anyway to Define our loss function
our loss function is going to be so we're going to give it our inputs and our targets as its parameters as well as the
hidden state from the previous time step
Ok so then let's Define our
parameters that we're going to store these values in so I'm going to
Define four parameters ok these are lists that going to store values at every time step in okay as we?
compute them
So these are empty
dictionaries
So x of x so x s is going to will store the one hot encoded input characters for each of the of the 25
Time steps so can Sort this will store the input characters
Hs is going to store the Hidden State outputs, okay?
Yx will store the target values and Ps is going to take the y's and convert them to normalize probabilities for Chars
okay, so then let's
go ahead and say
H of x h oh sorry Hs the value of hS
Is going to be the reason we're copying that so check this out?
we're going to initialize this with the previous hidden state the hS currently with the previous in States and
Using the equal sign would just create a reference, but we want to create a whole separate array
so that's why we don't we don't we don't want hs with the element negative 1 to automatically chain if
Change if h previous has changed, so we'll create an entirely new copy of it
And so then we'll initialize our loss as 0 and then and then okay
So we'll initialize our loss as 0 so this is our loss scalar value, and then we'll go ahead and do the forward pass
So the forward pass is going to look like this ok so we've already looked at it
Mathematically, and now we can look at it programmatically
So we'll say ok so for each value in the range of inputs so for the length of inputs
Let's compute a forward pass. So the forward pass is going to be
We're going to start off with that one of K representation
We place a 0 vector as the teeth input and then inside that t input
We use the integer in inputs list to set the correct value
There okay, so that's that in that second line
And then once we have that we're going to compute the hidden state now remember
I showed you the equation before we just repeat that equation here
And then we compute our output just like I showed before and then our probabilities of the probabilities for the next Chars
Once we have our probabilities. We'll compute our softmax cross-entropy loss. Which is the negative log, Likelihood. It's also called the
entropy you'll actually see that in tensorflow the cross entropy as a
Predefined function, but we're computing it by hand here and so once we have the forward pass now. We can compute the backward pass
We're going to compute the gradient values going backwards
So initialize empty vectors for these gradient values right so the gradient
So these are the gradient are the derivatives the derivatives are our gradient? It's the same thing here
So we're computing our
Derivatives with respect to our weight values from x to h from h to h and then from h to y?
and we'll initialize them as zeros and then also we also want to derive we also want to compute partial derivatives or
gradients for these of
bias values for our hidden state and our output
and then
As well as for our next which means the next time step that the hidden state in the next time step
derivatives for all of them when we do back propagation we're going to we're going to collect our output probabilities and then
Derive our first gradient value now our first gradient value. It looks like this. Let me go back up here
This is how we compute our first gradient value with respect to R
With respect to our loss all right, that's the first gradient value
so
We're going to compute the output gradient
Which is the output times the hint states transpose, and we can think of this so one so check this out right here
So this is our first partial derivative with our for our our hidden states to y2 our output
layer that Matrix and you can and so what we do is we compute the dot product between that output and the
Transpose of the Hidden state the reason we use a transpose is we can think of this
Intuitively as moving the error backward through the network giving us some sort of measure of the error at the output of that layer
So when we compute the dot product between the transpose of some layers Matrix with the derivative of the next layer that?
Is moving the error backwards it's kind of it's it's back propagation because the error value
it's constantly changing with respect to every layer that it moves through and by multiplying it by the
Transpose of a layer the dot product from the
partial derivative with the previous layer times where we currently are it's going to output a gradient value that that that derivative right and
We'll use that derivative
later on to update other values as well, so
We're also going to compute the derivative of the output bias and then we're going to back propagate into h
so
Notice how we are continuously performing dot product operations here for every single layer we have
We're also back propagating through the pan
H non-linearity right, so we are computing the derivative value
and this is programmatically what they do what the derivative of tan h looks like and we're using the
Computed derivatives from the previous layer that we were at at the end of the network the tail-end as we move through
to the beginning as
We're using them as values to compute the dot product of the whole point of computing the dot product with these
with respect to each of these layers is that we are computing new
Gradient values that we can then use to update our network later on so then we use that
raw value to update our hidden value
And then we lastly with we compute the derivative of the input to the hidden layer as well as a derivative
derivative of the Hidden layer to the hidden layer and
Once we have that we can return all of those derivatives our gradient values are our change values
We can return all of that now
There's also this step right here to mitigate exploding gradients which we're not going to go into right now
Because it's not really necessary however. I will say this that
whenever you have really really long sequences of input Data like
Like the Bible just a huge book then what happens is as the gradient is moving by moving
I mean you're computing the dot product of it for every layer with the current weight Matrix wherever you're at using the partial derivative
The the Gradient value will get smaller and smaller
There's it's a problem with recurrent networks that's called the vanishing gradient problem, okay?
And so it gets smaller and smaller, and there's a way to prevent that one way is to clip
those values by defining some some interval that they can that could they can reach or
Another way is to use lSP m networks which we're not going to talk about but anyway
Yeah, so that's our forward backward
Paths we computed that inside of the loss function and we computer our loss as well right here using softmax Cross entropy
So for as many characters as we want to generate we will do this so we'll say
The forward path is just like we did before it's the same exact thing. It's just repeating the code over and over again
Input times weight activate repeats get the probability values pick the one with the highest probability
Create a vector for that word
Customize it for the predicted char and then add it to the list and we just keep repeating that for as many n
defines how many characters we want to generate so we can generate as many characters as we want on a trains network and
We'll print those out
Ok so then for the training part we've really competed
We've completed that that meted that code right but now for the training part
We're going to feed the network some portion of the file and then for the loss function
We're going to do a forward pass to calculate all those parameters for the model for a given input
For a given input output pair the current char the next char and we're going to do a backward pass to calculate all those gradient
values
And then we're going to update the model using a technique if it's a type of gradient descent technique
Called a de grad which is just it just decays a learning rate, but it's great in descent
You'll see what I'm talking about
It's not complicated
But it's called a de grad
So we're going to create huge arrays of Chars from the data file the Target one is going to be shifted from the input one
So basically just shifted by one as you notice here
So now we have our inputs in our targets right and these numbers are actually character values in the dictionary, but they represent
The Hippo they help us create vectors
Where the indices here represent the one out of all the zeros and the zero vector, and that's what we feed into our model
so at a grad is our gradient descent technique and the difference here as
Composed as compared to regular gradient descent is that we decayed a learning rate over time and what this does is it helps our network
learn more efficiently, this is this is the
Equation for at A-- grad we're a step size means the same thing as a learning rate, but basically
the learning rate gets smaller smaller during training because we introduced this memory variable that grows over time to calculate the step size and
The reason it grows while the step size decreases is because inside of the denominator of this function right here. This is a programmatic
Representation of the mathematical equation that you're looking at right here
So here's the programmatic implementation of that we calculate this memory value
Which is we is our gradient right of our parameters, and then we update our and then we update our weight Matrix
Condition on the learning rate which decays over time via this function right here
so finally so this is really this is so we compute we've done all the math and now it's just implementing it, so
So we have our weight Matrices here
We have our memory variables for atta grad and then we will say for a thousand iterations
Well actually a thousand times 100 iterations. We want to feed the loss function
We want to feed the loss function the input vectors to see how this part works
We're going to feed the loss function our input vectors
And then we're going to compute a forward task using that loss function, and it's going to compute the loss as well
it's going to return the loss function or the lost scalar it's going to return the derivatives or
gradients with respect to all of those weight values that we want to update and then we're going to
Perform the parameter update using a des grad right, so we'll feed all those derivative values to our at a grad
This is a de bresse, so so we'll feed all those derivative values to our ad a grad function right here
Okay, and it's going to update our parameters and basically the learning rate just decays over time
that's what that's why I'm m is calculated to decay to learning rate over time and
which just helps with convergence and there's different gradient descent techniques atom on different ones like that, but
Momentum, but yeah atta grad is one of them and so once we do that
We can look at our sample function here or sample function and our sample function is going to right here
We're going to keep we're going to generate 200 word of 200 character
Sentences at a time for for a thousand times 100 iterations, so a lot of iteration 100,000 iterations
Okay, so let's go ahead and run this and see what happens
Okay, see the first iteration is really bad look at that
It's just like weird characters, okay, but now it's got more human readable characters. Okay. It's getting better now
It's like he ate less notice how the loss of decreasing very rapidly here as well, okay, and so yeah
it's getting better over time okay, so
That's it for our network
And let me stop this and you can feed it anything really you can feed it any text file
It's going to work with any text file, okay?
So we've computed the forward pass the backward pass that the backward pass is just a chain rule, okay?
I've got links to help you out in the description, but it's just a chain rule
We're just continuously computed computing derivatives or gradient values partial derivatives or gradients same thing
We call them partial because there were with respect to each of the weights in the network
Going backward and we are moving this error by moving. We're computing the dot product of each layer
Matrix by the derivative of the Previous layer just continually
And that's the chain rule
And if we do this, we can generate words we can generate any type of word
We want to give in some some text corpus, you can generate Wikipedia articles
you can generate fake news you can generate anything really code and
Yeah, so also for deep learning
You might be asking what for deep learning which of these layers do we add deep which?
Where do we add Deeper layers to do we add?
More layers between the input and the hidden state between the hidden state and the output or between the hidden state and the hidden state
Matrix which Direction do I add deeper and deeper layers well the answer is that it depends?
This is one thing that's being worked on but the but the idea is that you'll get different results for whatever
whatever set of Matrices that you
Add deeper layers to there's different papers on this, but yes adding deeper layers is going to give you better results, and that's deep learning
Recurrent nets applied to deep learning, but this is a simple three layer feed-forward network. That works really well
and I would very much encourage you to check out the github link in the
And the learning resources to learn more about this so yeah, please subscribe for more programming videos and for now
I've got to do a fourier transform
So thanks for watching
