Hello World! It's Siraj
and let's make our own language translator using TensorFlow
Today there are about 6,800 different languages spoken across the world
and in an increasingly globalised world
nearly every culture has interactions with every other culture in some way
that means there are an incalculable number of translation requirements every second of every day across the world
Translating is no easy task
A language isn't just a collection of words and of rules of grammar and syntax
it's also a vast inter-connecting system of cultural references and connotations
and this reflects a centuries old problem of
two cultures wanting to communicate but are blocked by a language barrier
our translation systems are fast improving though, so whether it be an idea or a story or a quest
each new advancement means one less message will be lost in translation.
During the Second World war, the British government was hard at work
trying to decrypt the morse coded radio communications that Nazi Germany used to send
messages sucurely, known as Enigma.
They decided to hire a man named Alan Turing to help in their effort
and when the American government learnt of their translation effort
they were inspired to try it themselves, post war.
Specifically because they needed a way to keep up with Russian scientific publications.
The first public demo of a machine translation system, translated 250 words between Russian and English in 1954.
It was dictionary based so it would attempt to match the source language to the target language
word for word
The results were poor since it didn't capture syntactic structure.
The second generation of systems used Interlingua
that means they changed a source language to a special intermediary language
with specific rules encoded into it
then from that generated the target language.
This proved to be more efficient, but this approach was soon overshadowed by the rise of statistical translation in the early 90s
Primarily from engineers at IBM
Innovation at IBM
Watch this
A popular approach was to break the source text down into segments
then compare them to an aligned bi-lingual corpus
using statistical evidence and probabilities to choose the most likely translation.
Nowadays the most used statistical translation system in the world is Google Translate
and with good reason.
Google uses deep learning to translate from a given language to another with state of the art results
So how do they do this?
Let's recreate their results in TensorFlow to find out
The dataset we'll be using to train our language translation model
is a corpus of transcribed TED talks
It's got both the English version and the French version
and our goal will be to create a model that can translate from one to the other after training.
We'll be using TensorFlow's built in data_utils class
to help us pre-process our data set and we'll start by defining our vocab size
which is the number of words we want to train on from our dataset.
We'll set it to 40k for each which is a small portion of the data
then we'll use the data_utils class to read the data from the data directory.
Giving it our desired vocab size and it will return the formatted and tokenised words in both languages
We'll then initialise TensorFlow placeholders for our encoder and decoder inputs
Both will be integer tensors that represent discrete values
they will be embedded into a dense representation later
We'll feed our vocabulary word to the encoder and the encoded representation that's learnt to the decoder.
Now we can build our model
Google published a paper more recently discussing a system they integrated into their translation service
called Neural Machine Translation.
It's an encoder decoder model inspired by similar work from other papers on topics like text summarisation.
So whereas as before Google Translate would translate from language A to English to language B
with this new NMT architecture, it can translate directly from one language to the other
It doesn't memorise phrase to phrase translations
instead it encodes the semantics of the sentence.
This encoding is generalised so it can even translate between a language pair like Japanese and Korean
that it hasn't explicitly seen before.
So I guess we can use a LSTM recurrent network to encode a sentence in language A
the RNN spits out a hidden state 's' which represents the vectorised contents of the sentence.
We can then feed 's' to the decoder which will generate the translated sentence in language B, word by word.
Sounds easy enough right?
WRONG!
There is a drawback to this architecture, it has limited memory
that hidden state 's' of the LSTM is where we're trying to cram the whole sentence we want to translate
's' is usually only a few hundred floating point numbers long
The more we try to force our sentence into this fixed dimensionality vector
the more lossy our neural net is forced to be.
We could increase the hidden size of the LSTM after all they're supposed to remember long term dependencies
but what happens is as we increase  the hidden size 'h' of the LSTM
the training time increases exponentially.
So to solve this we're going to bring 'Attention' into the mix
If I was translating a long sentence, I'd probably glance back at the source sentence a couple times
to make sure I was capturing all the details.
I'd iteratively pay attention to the relevant parts of the source sentence
We can let neural nets do the same by letting them store and refer to previous outputs of the LSTM
This increases the  storage of our model without changing the functionality of the LSTM
So the idea is once we have LSTM outputs from the encoder stored
we can query each output asking how relevant they are to the computation happening in the encoder
Each encoder output gets a relevancy score which we can convert to a probability score
by applying a softmax activation to it.
Then we extract a context vector which is a weighted summation to the encoder outputs
depending on how relevant they are.
Memory ain't enough, pay attention
Memory ain't enough, pay attention (In Hindi)
Memory ain't enough, pay attention (In German)
Memory ain't enough, pay attention (In Spanish)
Memory ain't enough, pay attention
We build our model using TensorFlow's built in embedding attention sequence to sequence function
giving it our encoder and decoder inputs as well as a few hyper parameters we define like
the number of layers.
It builds a model that is just like the one we discussed
TensorFlow has several built in models like this that we can drop into our code easily
So normally this alone would be fine and we could run this and the results would be decent
but they added another improvement to their model that requires
MORE CODE
A 100 GPUs
and a WEEK OF TRAINING
Seriously that's what it took
we won't implement it all programatically but let's dive into it conceptually
If the outputs don't have sufficient context then they won't be able to give a good answer
we need to include info about future words, so that the encoder output is determined by the words on the left and the right.
We humans would definitely use this kind of full context to determine the meaning of a word we see in a sentence.
The way they did this is tho use a bi-directional encoder
so it's two RNNs.
One that goes forward over the sentence
and the other goes backwards.
So for each word it concatenates the vector outputs
which produces a vector with context from both sides.
and they added a lot of layers to their model.
The encoder has one bi-directional RNN layer and seven uni-direciotnal RNN layers
The decoder has eight uni-directional RNN layers
The more layers the longer the training times
so that's why we use a single bi-directional layer
if all the layers were bi-directional
the whole layer would have to finish before layer dependencies could start computing
But by using uni-directional layers, computation is going to be more parallel.
We'll initialise our TensorFlow section, then our model inside of it
Let's see some results after training.
First I'll give it this phrase
Looks good
and now another phrase
DOPE!
While it's not perfect and we still have a way to go
we're definitely getting closer to having a universal translation model.
Breaking it down
Encoder-Decoder architectures are for state-of-the-art performance in machine translation
by storing the previous outputs of the LSTM cells
we can judge the relevancy of each to decide which to use
via an attention mechanism.
And by using a bi-directional RNN, the context of both past and future words is used to create an accurate encoder output vector.
The coding challenge winner from last week is Ryan Lee
This was very impressive, he created a recipe summariser by scraping a 125,000 recipes from the web
and documented it all beautifully with installation steps so you can reproduce the results yourself.
WIZARD OF THE WEEK!
and the runner up is Sarah Collins
her code converts scientific papers to text and prioritises them by topic.
This weeks coding challenge is to create a simple translation system using an encoder-decoder model.
All the details are in the readme, post your github link in the comments and
I'll announce the winner next week.
Please subscribe for more programming videos
Check out this related video
and for now I've got to get a better GPU
So, thanks for watching!
