[MUSIC PLAYING]
LUIS CARLOS COBO:
Good afternoon.
My name is Luis Carlos Cobo.
I lead the WaveNet team.
I did mine for Google
in Mountain View.
And I oversaw and
worked on launching
WaveNet for the Google Assistant
first in October last year.
And that was also
launched to cloud later.
And I'm going to talk
a little bit about what
is WaveNet, how it
works, why it is better
than previous approaches
for text-to-speech,
and a little bit
of how it compares
with the other methods.
So before we had WaveNet,
the previous approaches
to text-to-speech basically
fall in two categories.
Usually the highest quality
one was concatenative.
And in this type of
text-to-speech approaches,
we take a large corpus of speech
from one speaker and we divide
it on small slices of
a few microseconds--
sorry, milliseconds.
And then the problem
with text-to-speech
is just a problem of finding
the right pieces of speech
to reconstruct what
you want to say.
On the other hand, with
parametric approaches,
we have a linguist build a model
or a mathematical function that
approximates how the
human vocal tract works.
And then we learn,
sometimes maybe using
machine learning, how
to drive that model
to be able to say
the things that we
want to say and to resemble the
speaker that we want to match.
So both approaches
have come a long way.
They are good for
many applications.
But they have some limits.
For concatenative
methods, the main issue
is the amount of data that
you need for it to sound good.
You need several
dozen hours of data
so that you have enough
pieces to reconstruct anything
that you might want to say.
If you don't have
enough amount of speech,
you will have to use suboptimal
slices in some situations
and that will sound bad.
There will be noticeable
glitches in the audio.
The other problem
is with scaling.
Something like
concatenative doesn't
allow you to generalize
across speakers
or to do transfer
between speakers.
If you have a corpus of
speech for one speaker
and you want to add another
one, you need as much data
as you needed for the first.
The same applies for
expressive speech.
If you want your speaker to
be able to produce speech
in a range of
emotions, you almost
need to duplicate
the whole data set
for every one of the emotions
that you want to say.
On the other hand, it
sounds pretty natural,
because you are
using real speech.
But it has these limitations.
With parametric,
usually we can get away
with less data,
because we already
have a model that is
an approximation of how
the human vocal track sounds.
The problem here is usually that
the speech that is produced--
the quality of that speech
is limited by the quality
of the model that you have.
And these are never perfect.
So the speech ends up sounding
somewhat noisy, robotic,
and unnatural.
Here's where WaveNet came in.
WaveNet can be seen as a
parametric model on steroids
where everything, including
the human vocal tract,
it's learned in an implicit
neural network representation
from the data that you have.
Therefore, the
quality of your speech
is no longer limited by
the model that you have,
but by the amount of
data that you have,
the quality of that data,
and maybe computing power
you have to process that data.
Also, this allows
us first to produce
speech that is even of higher
quality than unit selection
for the same amount of data.
And it also allows
you to do transfer
learning across expressive
styles and across speakers.
So if you have a large corpus
for a language for one speaker,
you don't need as much data
to introduce a second speaker,
because the model can learn
a lot about generalities
of how humans speak
from the first data set
and you just need to
set up a little bit
of the second to fine tune
to learn the peculiarities
of another speaker.
So this allows us
to build more voices
and provide more speakers
and more emotional
expressive styles
for cloud customers.
So demo time, we'll
see if this works.
So we can hear
some real examples
from Google production systems.
So we're going hear first
the unit selection sample.
SPEAKER 1: A single
WaveNet can capture
the characteristics of
many different speakers
with equal fidelity.
LUIS CARLOS COBO: So as you
can hear, it's pretty natural.
Maybe it's a bit hard to
tell it without headphones,
but there's a few
glitches to work on.
SPEAKER 1: A single
WaveNet can capture
the characteristics of
many different speakers
with equal fidelity.
LUIS CARLOS COBO: Compare
with that WaveNet.
SPEAKER 2: A single
WaveNet can capture
the characteristics of
many different speakers
with equal fidelity.
Now it's fast.
LUIS CARLOS COBO: So we don't
have those glitches there.
It may be a bit difficult
to tell the difference
without headphones.
We spent a lot of time
divining these models
with the headphones on.
But there is a real improvement
and difference that you can--
a real improvement
in quality that you
can measure in some ways
that we will see later.
With respect to parametric,
the difference--
it's much more clear.
I think it will be evident
even through the speakers.
So this is one
example of parametric.
SPEAKER 3: A single
WaveNet can capture
the characteristics of
many different speakers
with equal fidelity.
LUIS CARLOS COBO: And WaveNet.
SPEAKER 4: A single
WaveNet can capture
the characteristics of
many different speakers
with equal fidelity.
LUIS CARLOS COBO:
So these were--
SPEAKER 4: Now it's fast.
LUIS CARLOS COBO: So these
were two models trained
on the same speaker on
the same amount of data.
And you can see that the WaveNet
one sounds much more natural.
Another language in Japanese.
SPEAKER 5: [SPEAKING JAPANESE]
SPEAKER 6: [SPEAKING JAPANESE]
LUIS CARLOS COBO: All right.
Now you know how to say
WaveNet in Japanese.
Let me see.
OK.
So what is WaveNet?
WaveNet is a neural network
[? adapted ?] for speech.
The main thing that
you have to know
is that before
WaveNet came around,
it was thought that it was
not possible to generate
speech audio sample
by audio sample
directly with a neural
network, because the time
dependence between in the audio
when you have a high sample
rate like 24,000
samples per second
is just too high for the model
that you would have to train.
So how did we do it?
So what you need to
know is that WaveNet
is an autoregressive
neural network
with dilated convolutions.
What does this mean?
Autoregressive means that
every sample that you produce
goes back as an
input to the system
to produce the next sample.
That way you create continuous
sound that sounds natural.
Convolutional means that,
instead of connecting
all neurons from one layer
to all neurons to the next,
you just connect it's neuron
to a couple or a small number
of neurons from
the previous layer,
and you reuse the [? weights ?]
of that connection
in all neurons that have
the same relationship.
That makes the model
small enough for us
to be able to learn
it officially.
And finally, the secret
sauce is the dilated part.
What we do is that these
connections that we reuse,
as we go up through the layers
of the neural network, we
make the neurons
that are connected
to the one in the
following layer,
they are spaced apart farther
and farther exponentially.
So what this gives us is that
the time range that that can
influence a sample
grows exponentially
with the number of layers
instead of linearly.
So keeping a small model with
a moderate number of layers,
we can get a very high
receptive field that
allows us to learn a model
the long term relationships
that we need for the
speech to sound natural.
So this paper describing this
work came out in October 2016.
It made a splash because it
sounded much more natural
than previous system.
But it was terribly
slow, mostly due
to the autoregressive
nature of the model.
You have to run a fairly
complex neural network just
to get one sample of audio.
And you need to do that
24,000 times to generate
one second of audio.
And there's a lot of
optimizations that you can do.
But it is not feasible
with current hardware
to get it fast enough for a real
time text-to-speech system that
usually needs to run
orders of magnitude
faster than real time.
So what did we do?
So this motivated a
new line of research
that resulted in a new paper
published roughly a year later
after the work was in production
called Parallel WaveNet.
So what we do is that once
we train the original WaveNet
model, we use that as a teacher
for a second neural network
that is much faster.
This neural network just
takes in a vector of noise
and transform it to
sound like the speech
that we want to generate.
That generated
waveform is then passed
to the already trained
WaveNet model that
scores how likely this is to
be speech from a human being
or from the particular
speaker that we
want to replicate actually.
At the beginning, this network
produces just random noise,
but little by
little it learns how
to produce audio that pleases
the original WaveNet network
and that sounds--
basically, it learns to
imitate the original network.
The good thing why we
go this roundabout way
is that this new network,
it's feedforward,
which means that
in one single pass
you can generate
all the neutrons.
You don't have to go
sample by sample anymore.
And it's not only faster,
but it can be parallelized.
So you can chop
neutrons in many pieces,
send it to different
processors, different computers,
and assemble it.
And that's what allows us
to get the latency that we
need to make this available
in production systems.
So this allowed us actually to
increase the speed of WaveNet
by three orders of magnitude.
So we went from generating
20 milliseconds in one
second, so significantly
slower than real time,
to generate in 20 seconds
of audio in just one second,
so 20x real time.
And that was enough for us
to use this in production.
And even though we gave up
on the autoregressive nature,
and that can have an
effect on quality,
even these Parallel
WaveNet still
closes 70% of the
perceived naturalness
gap between synthesized
and real speech.
Here you can see these are
more scores, which are,
basically we have blind testing,
which we send audio samples
to people and they score how
natural they sound from one
to five, five in the maximum.
Usually, you don't get a
five, because people always
think that maybe you're
trying to fool them.
Actually, real speech
usually gets around 4.5.
And we are able to push the
gap between synthetic and real
I said by 70%.
And we believe that
[INAUDIBLE] proselytes
are going to allow us
to fully close that gap.
Thank you so much.
With this, I will
pass to Dan Aharon
who's a PM in Google Cloud TTS.
Thank you.
DAN AHARON: Thanks, Luis.
Hi.
I'm Dan.
I'm a product
manager in Cloud TTS
like Luis said and a couple
other products in Cloud AI.
So I want to tell
you a little bit more
about Cloud Text-to-Speech.
So first, this technology,
what is it good for?
So three main use
cases we see used.
One is in call centers for
automated voice responses.
So a lot of IVRs today,
interactive voice response,
they need to prerecord all
the prompts so that they can
play them when people call in.
With TTS, they can now
generate them automatically
and they don't need
to prerecord it.
They were forced to prerecord it
because synthesized speech was
not really that
good up until now.
But now it suddenly
becomes good enough.
And the other benefit is you
have much more flexibility
in language.
So you can insert entities
that change instead
of having one script that
was recorded three months ago
and then you can't
deviate from it.
Second thing-- similarly in IoT,
if you want to talk to devices
and have them talk back,
it's a very useful thing
to have so that you can have
conversations with your users.
And last but not least, media.
A lot of written media can
now find a new form in audio.
And you're going to hear
from Deanna a little bit more
about that shortly.
So then Cloud Text-to-Speech was
introduced three or four months
ago.
It's part of our conversation
group in the building blocks
and part of our
Cloud AI portfolio.
So if you haven't
already, definitely
recommend you check out
some of the sessions
for other products.
There's a lot of really
cool products in Cloud AI.
So Cloud TTS, as I mentioned,
was introduced in late March.
And it gives everyone the
power to use the same TTS
that Google does.
And that includes
using WaveNet voices.
We fortunately have the
ability to run stuff
on TPUs and other things.
And we can produce a machine
learning based speech synthesis
API at scale.
It's really easy to use.
You're going to see
that in a little bit.
And it's pretty flexible.
You can use text, or SSML, or
do all of these other things.
So a few new things
we have for you today.
First, WaveNet, up
until today, has only
been available in English US.
We now have seven more
languages that are available.
So that's pretty big.
It's been one of our
biggest requests from users.
So we're very
excited about that.
So it's now available in
English, German, French,
Japanese, Dutch, and Italian.
French is not live yet.
I think it will be live maybe
next week or pretty soon.
Second thing is audio profiles.
And we're going to talk a
little bit more about that.
So I'll come back to it.
So this is now our full
portfolio of voices.
14 total languages and variants.
And there's 30 standard
voices and 23 WaveNet voices.
So across them you get
reasonably global coverage
with a few pockets
that are missing
that we're working on covering.
OK.
So the second thing
we're introducing today
is audio profiles.
So up until now, text-to-speech
produces a single wave file
and then, as a
developer, you can
use that wave file
to play it anywhere
you want to, whether
that's in a tiny speaker,
whether that's on headphones,
or whether that's on a phone
line or anywhere else.
What we found is
that the quality
of the speaker or
the attributes of it
can have a pretty big impact on
the quality of the sound that
comes out.
And so if you want to aspire
to get the best quality,
you should actually have a
different waveform that's
sent to each type of speaker.
So starting from
today, you can actually
provide this audio
profile choice.
You can tell us
whether it's going
to be played on a handset, or
on a home entertainment device,
or on a phone line,
and then we'll
do the proper adjustments.
So here for example, this
is an example wave file
and how it looks
like on a phone line.
So you can see all of this
area there on the left.
And all of this treble
area on the right.
You don't actually hear
them on a phone line.
So when you try and
play this wave file,
it's going to sound
distorted because you're
missing a lot of
information that
doesn't get carried across.
And so what we're doing
with audio profiles
is we're compressing
it from the sides
into the middle, for this
example, for a phone line.
So you can see now the
waveform looks like this.
And you get much
more information
there in the middle,
which sounds better.
And so if I were to
play it on my laptop,
it'll actually sound worse.
But that same wave file when
you play it on the phone,
it sounds better.
OK.
So with that, let me go to
the code lab and the demo.
Let's start from the demo first.
So Text-to-Speech-- I'm just
going to go to the news.
Let me make my screen bigger.
What [INAUDIBLE].
[LAUGHS] Let's use this article.
And let's just copy this text
and paste it into our new Cloud
Text-to-Speech API.
And I'm going to first play
a regular voice so you get
a sense for how it sounds like.
SPEAKER 7: It's official.
Carmelo Anthony is now a
member of the Atlanta Hawks.
For now, the three
team trade sending
Anthony, Justin Anderson, and
a 2022 lottery protected--
DAN AHARON: So you can hear
it's a little robotic, right?
Now let's play the exact
same thing in WaveNet.
SPEAKER 8: It's official.
Carmelo Anthony is now a
member of the Atlanta Hawks.
For now, the three team
trade sending Anthony,
Justin Anderson,
and a 2022 lottery
protected first
round pick via OKC
to the Hawks, Dennis Schroeder,
and Timothy Luwawu-Cabarrot
to the Thunder, and Mike
Muscala the 76ers is official.
DAN AHARON: So you can see
it doesn't sound 100% human.
But if I weren't
telling you that this
is played by speech synthesis
and if you were just listening
to it, I would have imagined
this is an NPR report
or something like that--
SPEAKER 8: It's official--
DAN AHARON: --especially as
you get to the second paragraph
here.
SPEAKER 8: Carmelo
Anthony is now
a member of the Atlanta Hawks.
For now, the three team
trade sending Anthony,
Justin Anderson,
and a 2022 lottery
protected first
round pick via OKC
to the Hawks, Dennis Schroeder
and Timothy Luwawu-Cabarrot
to the Thunder, and Mike Muscala
to the 76ers is official.
It's official.
DAN AHARON: OK.
So let's go to the code lab.
[APPLAUSE]
Oh.
Thank you.
So what we're going
to see next is
I'm going to show you how
to take an audio file that I
have here and we're
going to transcribe it
with speech-to-text.
Then we're going to translate
it to a different language
and then we're going to turn
it into WaveNet and play it.
So there's a lot of
things that can go wrong.
And I'm not a very
good Python developer,
so work with me here.
And let's try and
do it together.
Hopefully, we'll be
able to get through it.
OK.
So this is the simple
text-to-speech example
that's on our website.
And so let's leave it for now.
Let's come back to it.
Let's add some code
that does transcription.
So this is the speech-to-text
sample on our website.
I'm going to copy over
these import statements.
So we already have [INAUDIBLE],,
so we just need IO.
And then let's copy all of this.
OK.
And instead of client,
let's call it speech client.
OK.
And now let's give it a path,
slash users then documents
audio slash [INAUDIBLE].
Going to play this file.
SPEAKER 9: Hi.
I'd like to--
DAN AHARON: Oh, sorry.
Not this one actually.
I wanted this one.
SPEAKER 8: Welcome, everyone,
to the Google Cloud Next
Session for text-to-speech.
Hope you have a great day.
DAN AHARON: OK.
I'll put speaker
[INAUDIBLE] this one.
OK.
Dot wave.
And then we don't need
sample [INAUDIBLE] because it
auto detects it.
The language is EN-US.
Let's add punctuation.
Do we have punctuation here?
No.
It's probably in
the beta snippets.
Punctuation.
Yeah.
Here it is.
Enable punctuation equals true.
So I'm going to
paste it in here.
And let's also use
the video model.
Think you can you
do that with model.
Model equals video.
Yeah.
Equals video.
Are we using the beta
speech or the formal one?
Google Cloud-- OK.
We need this so
it gets the beta.
So let me make sure
that we're using that.
From Google Cloud,
import speech or speech.
OK.
Great.
So now we have client
recognized and then
it's printing the response.
So it says this.
Let's just run this and see
that it's working correctly.
And then here let's
make this linear 16
and call it output.wave.
Output.wave.
And then instead
of this text, we
can do alternative transcript.
OK.
We can do that later.
Let's try and run this now.
So Python-- let's go to the
text-to-speech directory first.
Python [INAUDIBLE]
part synthesize text.
Let's just give it a text.
OK.
So it has no model field.
It's probably not
using the beta.
We can probably do it
without the model, but--
Let's just see if there's-- oh.
Yeah.
It should be speech client.
Let's try it again.
If not, we can remove the model.
Let's just remove the model.
Maybe I have a typo
there or something.
OK, let's try this again.
Skip the punctuation, just
the stuff that's not in beta.
Line 48 here, something's
wrong with the audio.
AUDIENCE: It's the version
of the [INAUDIBLE]..
DAN AHARON: Version,
which version?
AUDIENCE: You want B1 beta.
DAN AHARON: Yeah it's
because I'm mixing.
I'm mixing the beta
with the non-beta.
So let's just use
the regular one.
We don't need the beta.
Let's try this now.
OK.
Welcome everyone to
the Google Cloud Next
session with text-to-speech.
Hope you have a great day.
So it doesn't have punctuation.
So that means the speech
we're going to produce
will not be as good.
But that's fine for now.
OK pick a language, guys.
What are you feeling today from
one of the ones with WaveNet
support?
German.
Let's do German.
So let's go to
translate Google Cloud.
Let's look at the code sample
here, Python, view on GitHub.
We'll add these imports
and then translate text.
It was a translate
text, here it is.
Translate client and result.
So we've done the recognition.
Now let's translate client.
And we don't need this stuff.
Let's translate it to German.
And the transcript equals--
we basically want
response0.altern
atives0.transcript.
Hopefully I got that right.
Let's just print it to be sure.
Transcript and transcript.
And then translate the
client, translate texts.
Instead of text, we'll
write transcript.
This is translation result.
And then let's input
that text here.
OK.
Oh sorry, and text-to-speech.
We should tell it
that it's now doing
German instead of English US.
OK let's try that.
Cannot input new.
I think I need to go and
set up the Cloud client.
Install-- is that the command?
OK, Let's try that again.
Let's try the Python
command again.
OK.
Response, where is it?
Oh, here it is.
OK so transcript=response0.
Oh no, it's response in
result. Yeah, exactly.
Thank you.
Response does result zero.
Let's try this again now.
All the things that could go
wrong in a live coding session.
OK.
We got it.
Now there's just this
translated text thing.
So I think it's square
brackets, right?
Yeah, it's results
translated text,
so result translated text.
Does anyone here speak
German by the way?
How can we test if
it's actually working?
Yes, no?
AUDIENCE: Yes.
DAN AHARON: Yes?
OK.
so this is the moment
of truth, guys.
So we are in speech
Cloud client and oh,
text-to-speech cloud client.
And we should have
this output Wave file.
Let's play it.
SPEAKER 10: [SPEAKING GERMAN]
DAN AHARON: Was that right?
AUDIENCE: Yes.
DAN AHARON: Great.
Thanks, everyone.
We did it together
with your help.
So if I can do it with my poor
Python programming skills,
that really is a
sign that anyone can.
So please recommend that
you play around with it,
see what you can do.
So with that, let
me welcome Deanna.
Thank you very much.
DEANNA STEELE: Thanks
everybody for still being here,
I appreciate that, I'm really
excited to share with you some
of the information
about how we can
use this amazing technology.
So my name's Deanna Steele.
I'm the CIO at a company
called Ingram Content Group.
Has anybody heard of Ingram?
Hands.
Well, yes.
I know some of you
in the front have.
Thank you.
For the rest of you, Graham
Ingram Content Group,
we connect books with readers.
But what we are is we are the
global content distributor
for book-related content.
And that includes
physical books.
That includes e-books
and audio books.
It also includes
providing metadata
to our customers, who
tend to be retailers,
through all channels.
And it includes ingesting
publisher content, so publisher
metadata and so forth.
And because we have
this ecosystem that
relies on publishers and
retailers, what we do
is we provide analytics
back to publishers.
We deal with big publishing
houses, the big guys.
And we deal with small
independent publishers, as well
as independent authors.
Our customers are retailers.
They're direct consumers.
They're libraries.
And they're educators.
I'll go through this quickly.
Three key themes
we're seeing that
make this technology
really reasonable
and relatable right
now in the marketplace
and where we think there's
a huge opportunity.
So first of all,
the business trends
have lent toward this technology
really coming to fruition
and really making
a big difference.
The opportunities we see span
accessibility and other areas.
And then we'll talk a
little bit about how
we'll deploy the technology.
Sorry.
I'll move here.
Our innovation.
So today, actually in 2017, we
distributed 270 million audio
and e-books around the world.
We print a new book
every six seconds.
And we print on our
high speed HP printers
about 500 books an hour.
We span the globe--
if we were to look
at all the physical and digital
content that we've produced,
we'd span the globe 1.2 times.
The business trends.
So what's happening is barriers
to entry have fallen away.
So several years
ago, if you wanted
to print your book,
your own memoir,
or your own
educational book, you
would have had to
write the outline.
You would have had to shop
it to either publishers
or other agencies.
And you would have had
to do that numerous times
and suffer rejection,
hopefully not a lot.
But it happened.
And the average lifespan to get
a book accepted and published
would have taken
six to nine months.
Today, it can take up to weeks.
We have worldwide
distribution capabilities
so we support 28 facilities
either office or distribution
center and then have access
to about 220 countries.
The business trends, so
direct reach and discovery.
We work with publishers
on strategy, so ideation.
We've dealt with
publishers and retailers
through channels
for a long time.
And that's our sweet spot.
So we help
publishers, especially
small to independent publishers
and independent authors
find their way to direct
distribution to content.
We provide analytics.
That we have advanced
analytics platforms
that allow data
visualization and some degree
of predictive analytics.
Different topic,
different time, but we're
getting into data
science in that area.
And we have hundreds
of publishers
that rely on those platforms
to understand how they manage
and market their business.
We provide discoverability.
So because we deal with
publisher metadata--
you're all familiar
with metadata--
we actually allow publishers
to ingest their content.
And we help make
recommendations to them
as to how to make
their books more
marketable and discoverable,
which is a big deal.
If you think about publishers
or even independent authors,
very often, they're
not really aware of how
to be successful
in that business.
They just know they want to
produce that best seller.
We can help them do that.
And then finally, the metadata
that we ingest we actually
sell.
So we've talked about monetizing
data in previous sessions.
We do monetize data.
And that's a very important
part of what we do.
OK so let's talk a little
bit about business trends.
181 million adults in the
US read a book per year.
Who here has read a physical
book in the last year?
OK, not surprising
for this audience.
And that kind of
reflects what we've seen.
So the United States population
includes about 326 million
people.
And of those, about
55% are adults.
And of those, 181 million read.
So it's interesting.
What we're seeing is that
books in any format-- about 74%
of the population reads books.
But what we heard and what we
thought over time would happen
is that e-book distribution
would eclipse physical books.
And we've seen that
that hasn't happened.
So what we have seen, though,
is that the audio book,
the listening to the audio books
has increased significantly.
And we're working with partners
to ingest even more audio book
content.
Why?
For a few reasons.
Business trends
including accessibility.
So for us, accessibility
is very important.
When we think about
how we provide access,
whether through E
or prints, those
are two key, of course, methods.
But what we're seeing
with accessibility
is that the US census had about
9 million people identified
in the United States alone that
were either hearing impaired
or they were deaf.
And it's a challenge
because only 39,000 Braille
books had been printed.
It's a very small
percent of the population
and not everybody has
access to Braille,
nor do they all have
access to voice readers.
What that means is the
percentage of the population
that, potentially,
we can provide access
to is significant.
Not only when we think about
hearing impaired, but also
learning impaired--
so people potentially
with dyslexia who have a
hard time reading, kids who
need that information
translated to them possibly,
and also people for whom
English is a second language,
and they want to be
able to quickly put
the audio and visual together.
So we find that
accessibility is key for us.
I'm going to give you a
snippet of what we believe
to be really important
and a potential for us
to be able to make this
text-to-speech resonate.
There are a few things
to think about here.
First of all, obviously, the
success around text to speech--
as you've heard--
has to do with understandability
and the ability
to sound natural.
In the past, when we've
heard text-to-speech,
we've heard very
robotic attempts.
And Bell Labs and
MIT have been working
on this technology for decades.
But what Google
is doing is really
uncovering that
natural language sound.
And so we're very
excited about that.
Well, we believe that book
discoverability is critical.
And in fact, Gartner says that
about 30% of all search in 2020
is going to be done via voice.
So it's going to be screenless.
So that's really important as
we think about enhancing book
discoverability and so forth.
The demo we're going
to give you here
is a snippet of a book
by Leif Enger It's
a book called "Virgil Wonder".
Grove Atlantic will
be publishing it
in October of this year.
Leif is a New York Times
best selling author.
And so we think this book
is going to do really well.
So imagine you're driving,
you hear an NPR segment
about this book.
And you think I'd love
to hear a segment.
Before I go ahead
and demo it for you,
I'd like to give you
a little background
and context on the book.
So the book is written
about a gentleman
living in the Midwest.
He owns a cinema and very
old fashioned cinema.
And he still plays reel
to reel projection.
So you'll hear something
about being unspooled.
His life is a little
bit unspooled right now.
And what happens to him we'll
demonstrate a little bit
about how he picks it back up.
We'll play you a snippet of
the first part of the book.
Ingram Content, Ingram Content.
SPEAKER 11: All right.
Getting the test version
of Ingram Content.
Hello.
What would you like read to you?
DEANNA STEELE: "Virgil Wonder"
SPEAKER 12: Now I
think the picture was
unspooling all along, and
I just failed to notice.
The obvious really isn't so,
at least it wasn't to me,
a Midwestern male cruising
at medium altitude,
aspiring vaguely to
decency, contributing
to PBS, moderate in all things,
including romantic forays,
and doing unto others
more or less reciprocally.
If I were to pinpoint when
the world began reorganizing
itself, that is when my
seeing of it began to shift,
it would be the day
a stranger named
Room blew into our bad luck
town of Greenstone, Minnesota
like a spark from
the boreal gloom
It was also the day of my
release from St. Luke's
hospital down in Duluth.
So I was concussed and
more than a little adrift.
The previous week, I'd driven
up shore to a popular lookout
to photograph a distant storm
approaching over Lake Superior.
It was a beautiful storm,
self-contained as storms often
are, hunched far out
over the vast water
like a blob of blue ink.
But it stalled in
the middle distance.
And time just slipped away.
There's a picnic
table up there where
I've napped more than once.
What woke me this time was the
mischievous gale delivering
autumn's first snow. .
I leaped behind the wheel
as it came down in armloads.
Highway 61 quickly
grew rutted and slick.
Maybe I was driving too fast U2
was on the radio, "Mysterious
Ways", I seem to recall.
Apparently, my heart-broken
Pontiac breached a safety
barrier and made
a long, lovely--
some might say cinematic--
arc into the churning lake.
DEANNA STEELE: Thank you.
So we can go on to
purchase the book.
OK.
We'll go ahead and move forward.
Let me spend a little
bit of time telling you
about the technology.
The technology is a
combination of Google's WaveNet
and Ingram Content's
CoreSource application.
Our CoreSource system is our
e-book content distribution
system.
And we house over 18
million titles in it.
So there are two components,
as you all know around
audio text-to-speech.
And the first part
is the ingestion.
The second part is
the distribution.
So the way this works is that
we ingest publishers' content.
We do that today.
Again, we do e-book
distribution.
We bring that book content
into Ingram's CoreSource
and then push it out to
the Google Cloud storage.
The Cloud Functions-- so step
four includes three components.
It includes translating
it to a wave file
and then storing
it in Google Cloud.
This is for all new
and changed content.
If you think about
it, book content
doesn't change that
much, unless it's new
or unless new
editions come through.
The Cloud Functions then pull
that new content or the change
content through Cloud SQL
into two of our technologies,
one, Aerio, which is
a tool that allows
for written previews of books.
Publishers house
it on their sites.
And it allows them to
complete purchases.
And the second is our
ipage application,
which is it a
business-to-business solution.
The second part of it is
also where the secret sauce
happens, right?
So the first part is
translating that text-to-speech
and developing that wave file.
The second part of it is
when a sample is collected--
much like the sample
you just heard--
we bring it into the
DialogFlow Enterprise Edition.
And the system parses that
data, the title being the key,
passes through
the Cloud DataFlow
into Cloud SQL,
which then DialogFlow
pulls that sample
out of the Cloud SQL,
passes it through
the Cloud Functions.
And then if transactions
are to exist,
we pull it back into Google
Express for the conclusion
of the transaction.
So we can bring it all
the way through the sample
and listening to that
sample into the conclusion
of a transaction.
And we think that that's
going to be really powerful.
So in conclusion-- I don't want
to get between you and dessert
or what have you--
we talked a little bit about
some of the business trends
and what's happening
right now in the market.
We talked a little bit about
the changes from physical to E
to audio.
We talked about opportunities
around accessibility
and closing the
loop between people
who have needs and the systems
and the technology today
and what we can do, what Google
has done with natural language.
And we talked a little
bit about the technology
and how we support it.
I think we have
about 23 seconds.
So I'll ask if there are any
questions either for any of us.
[APPLAUSE]
All right.
Thank you.
[MUSIC PLAYING]
