hey everyone welcome to the semicolon in
this tutorial we'll be learning about
recurrent neural networks which is a
pretty interesting neural network
architecture we'll start with the
simpler implementations of RNN and
proceed to the complicated ones like LSTM or the long short term memory
so, the RNNs are interesting because the
results they produce in so many
applications are just amazing for
example when we train RNNs with
sentences they generate similar
sentences they've shown amazing results
with chatbots image captioning machine
translation and that is just the tip of
the iceberg so before we start I'd like
you to know that this channel has lots
of tutorials on machine learning and
data analytics you might want to have a
look at them and let's get started so
let's first discuss what is the problem
with our traditional feed-forward neural
networks feed-forward neural networks
are simply not good enough for these
tasks because feed-forward neural
networks need a fixed size input and
they give us a fixed size output they do
not capture the sequences or the time
series information nor do they account
for the memory now this makes them very
unsuitable for a lot of tasks which
involve time series data the recurrent
neural networks on the other hand
captures information about the sequences
or the time series data now they can
take variable size inputs give us
variable size outputs and they work
really well with time series data so
let's look at how they work now
understanding RNNs is kind of tricky
a lot of representations lead to
misconceptions so we'll start with a
basic formula of RNN and then visualize
it, now recurrent neural networks
work on this recursive formula the new
state of the recurrent neural network at
time T is a function of its old state
that is its state at time t minus 1 and
the input at time  which is xt this
function is the basic idea behind RNNs
let's look at the simplest
implementation of RNN and we call it
simple RNN it works with the formula we
just discussed the recursive function is
a tanh function we multiply the input
state with weights of X - Wx and the
previous state with Ws and then pass it
through a tanh activation to get the
new state Wx and Ws are the weights in
many places Ws and Wx are represented by W
now to get the output vector we multiply
the new state that is St with Wy
it looks like the diagram, we take the input
and the old state and calculate the new
state and using the new state we
calculate the output now this might be a
little confusing so to clarify
let's unroll the Rnn and see how it
works so we have a previous state as s0
and the input at time step 1 is x1 and
we input this to our RNN and the RNN
calculates the new state based on this
recursive formula and gives us the state
one - s1 and to get the output we multiply
s1 with Wy and the new state s1 and the
input x2 are the inputs for the next
time step and we get s2 and we again
get the output by multiplying it with Wy
the same thing goes on for many time
steps now you have to note that we use
the same set of weights throughout Wx  Wy and Ws remain the same throughout a
network in case of multi-layer RNNs
the output we calculate serves as input
and this helps us in creating
multi-layer RNNs here y1 and y2 act
like input to the new layers in general
deeper networks give better accuracy but
in RNN we do not go a lot deeper people
generally use 3 to 4 layer deep models
so neural networks
through back propagation and recurrent
neural networks learn using backpropagation through time so we calculate
the loss using the output and go back to
each state and update the weights by
multiplying gradients so say each state
has a gradient of 0.01 or 10 to the
power of minus 2 and we have 100 states
so we would go back each state and
update the weights. To update the first
state the gradient would be 0.01 to the
power 100 or 10 to the power minus 2 to
the power of 100 which is almost zero
and the updating weights would be
negligible and so a neural network
wouldn't learn at all and this would be
pretty bad situation because we would be
running our machine but our network
won't get any better so this problem is
what we call vanishing gradients problem
so to solve this problem and improve the
accuracy we add little more interactions
to RNN and this is the idea behind LSTM
now, LSTM solves the problem of
vanishing gradient and gives us much
better accuracy than RNNs so let us
look at how LSTM works and then visualize
it in a similar way. So, LSTM is
made up of three gates and one cell state
these gates and cell states are
additional interactions. Now we have the
forget gate which takes the old state
and the input and multiplies it with the
respective weights and then we pass it
through a sigmoid activation we have
input gate and output gate and we'll do
the same thing with them one thing to
note though is each gate has different
sets of weights. Here C' can be
called as an intermediate cell state
then we calculate Ct which is the cell
state using this function the input gate
and intermediate cell state are added
with the old cell state and the forget
gate so then we pass the cell state
through the tanh activation and then
multiply with the output gate note that
all the multiplications
element-wise multiplication and this
might be again tough to visualize so to
make things a little clearer let's just
try to form a network with it so we have
our old state as s0 and the input x1 we
also get the previous cell state which
is c0 so as 0 and x1 are the inputs when
I say Wf, Wi, Wo or Wc I mean
separate weight for X and S
that is the inputs and States they have
separate weight vectors but we have
clubbed it together to reduce the
complexity in visualization so we
proceed forward first we calculate the
input gate by passing the previous state
and the input through sigmoid activation
like this then we calculate the
intermediate cell state by passing our
input and the previous state through
tanh activation then we perform
elementwise multiplication and we
similarly calculate the forget gate and
multiply it with the old state c0 then
we add these to obtain a new cell state
which is c1 we calculate the output gate
and multiply it with the cell state
passed through the tanh activation
and this gives us the new state s1 the
new cell state and the state is passed
over to the next time step so that they
can use it for further calculations so
this is LSTM it solves the vanishing
gradient problem in this way and works
better than RNN in terms of accuracy so
that is it thank you guys in the next
tutorial we'll be implementing the
simple RNN and lstm so stay tuned thank you
