Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér.
Earlier, we talked about Google's WaveNet,
a learning-based text-to-speech engine.
This means that we give it a piece of written
text and after a training step using someone's
voice, it has to read it aloud using this
person's voice as convincingly as possible.
And this followup work is about making it
even more convincing.
Before we go into it, let's marvel at these
new results together.
Hm-hm!
As you can hear, it is great at prosody, stress
and intonation, which leads to really believable
human speech.
The magic component in the original WaveNet
paper was introducing dilated convolutions
for this problem.
This makes large skips in the input data so
we have a better global view of it.
It is a bit like increasing the receptive
field of the eye so we can see the entire
landscape, and not only a tree on a photograph.
The magic component in this new work is using
Mel spectrograms as an input to WaveNet.
This is an intermediate representation that
is based on the human perception that records
not only how different words should be pronounced,
but the expected volumes and intonations as
well.
The new model was trained on about 24 hours
of speech data.
And of course, no research work should come
without some sort of validation.
The first is recording the mean opinion scores
for previous algorithms, this one and real,
professional voice recordings.
The mean opinion score is a number that describes
how a sound sample would pass as genuine human
speech.
The new algorithm passed with flying colors.
An even more practical evaluation was also
done in the form of a user study where people
were listening to the synthesized samples
and professional voice narrators, and had
to guess which one is which.
And this is truly incredible, because most
of the time, people had no idea which was
which - if you don't believe it, we'll try
this ourselves in a moment.
A very small, but statistically significant
tendency towards favoring the real footage
was recorded, likely because some words, like
"merlot" are mispronounced.
Automatically voiced audiobooks, automatic
voice narration for video games.
Bring it on.
What a time to be alive!
Note that producing these waveforms is not
real time and still takes quite a while.
To progress along that direction, scientists
as DeepMind wrote a heck of a paper where
they sped WaveNet up a thousand times.
Leave a comment if you would like to hear
more about it in a future episode.
And of course, new inventions like this will
also raise new challenges down the line.
It may be that voice recordings will become
much easier to forge and be less useful as
evidence unless we find new measures to verify
their authenticity, for instance, to sign
them like we do with software.
In closing, a few audio sample pairs, one
of them is real, one of them is synthesized.
What do you think, which is which?
Leave a comment below.
I'll just leave a quick hint here that I found
on the webpage.
Hopp!
There you go.
If you have enjoyed this episode, please make
sure to support us on Patreon.
This is how we can keep the show running,
and you know the drill, one dollar is almost
nothing, but it keeps the papers coming.
Thanks for watching and for your generous
support, and I'll see you next time!
