Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér.
Due to popular demand, here is the new DeepMind
paper on WaveNet.
WaveNet is a text to speech algorithm that
takes a sentence as an input, and gives us
audio footage of these words being uttered
by a person of our choice.
Let's listen to some results from the original
algorithm, note that these are all synthesized
by the AI.
All this requires is some training data from
this person's voice, typically 10-30 hours,
and, a ton of computational power.
The computational power part is especially
of interest, because we have to produce over
16 to 24 thousand samples for each second
of continuous audio footage.
And unfortunately, as you can see here, these
new samples are generated one by one.
And since today's graphics cards are highly
parallel, this means that it is a waste to
get them to have one compute unit that does
all the work while the others are sitting
there twiddling their thumbs.
We need to make this more parallel somehow.
So, the solution is simple, instead of one,
we can just simply make more samples in parallel!
No, no, no, it doesn't work like that, and
the reason for this is that speech is not
like random noise - it is highly coherent
where the new samples are highly dependent
on the previous ones.
We can only create one new sample at a time.
So how can we create the new waveform in one
go using these many compute units in parallel?
This new WaveNet variant starts out from white
noise and applies changes to it over time
to morph it into the output speech waveform.
The changes take place in parallel over the
entirety of the signal, so that's a good sign.
It works by creating a reference network that
is slow, but correct.
Let's call this the teacher network.
And the new algorithm arises as a student
network, which tries to mimic what the teacher
does, but the student tries to be more efficient
at that.
This has a similar vibe to Generative Adversarial
Networks where we have two networks: one is
actively trying to fool the other one, while
this other one tries to better distinguish
fake inputs from real ones.
However, it is fundamentally different because
of the fact that the student does not try
to fool the teacher, but mimic it while being
more efficient.
And, this yields a blistering fast version
of WaveNet that is over a 1000 times faster
than its predecessor.
It is not real time, it is 20 times faster
than real time.
And you know what the best part is?
Usually, there are heavy tradeoffs for this,
but this time, the validation section of the
paper reveals that there is no perceived difference
in the outputs from the original algorithm.
Hell yeah!
So, where can we try it?
Well, it is already deployed online in Google
Assistant in multiple English and Japanese
voices.
So, as you see, I was wrong.
I said that a few papers down the line, it
will definitely be done in real time.
Apparently, with this new work, it is not
a few more papers down the line, it is one,
and it is not a bit faster but a thousand
times faster.
Things are getting out of hand real quick,
and I mean this in the best possible way.
What a time to be alive!
This is one incredible, and highly inspiring
work.
Make sure to have a look at the paper, perfect
training for the mind.
As always, it is available in the video description.
Thanks for watching and for your generous
support, and I'll see you next time!
