So, today we have
an intern talk by Han Zhao,
Han was an intern in our group.
He was working as part
of a bigger project
on Speech Enhancement, so this
project has been exciting and
we've had one more
talk before this.
So Han is from CMU, he is
advising Professor Jeff Gordon,
and he is in the Machinery
Department and he works on
a variety of machine learning
and Neural-Networks, problems
with some product networks and
a lot of other things.
So, today he will talk
to us about how to use Convolutional-Recurrent
Architecture to enhance speed
signals, and I will hand over
the floor to Han from here.
>> Thanks.
>> You're welcome.
>> Thank you.
So, today I will talk about
Speech Enhancement with Convolutional-Recurrent
networks.
So let's start with
the Motivation, so
why do we need Speech
Enhancement in the first place?
So, usually people care about
acoustic speech recognition
systems and the system
typically apply interfaces.
The first one is
the training phase and
the following one
is inference phase.
So what do people do
in training phase is,
we collect a large amount of
clean speech pair it with
corresponding transcriptions,
and
the clean speech will transcript
it into the system ASR system,
the system then will be
trained on this data.
And after that during
the inference phase
the Black-box ASR system
will be fixed, and
then it will be deployed
in reality to, in product.
But the problem is when it
was applied in practice,
the noise collect it from users
usually contain a lot of noise.
So this creates a problem like
the distribution mismatch,
what it means is the training
data we have it used to trained
ASR system is different from the
test data we see in practice.
So, how to release this problem
is the focus of this study
caught a speech enhancement, so
more formally the problem
is like focus.
So we're given noisy speech and
we want to find filters,
function approximator,
such that after applying this
function that with input as
noisy speech we get clean speech
as the first one in the outputs.
So here is the outline
of the talk today, so
I will first give a very brief
introduction to the Background
like the outline of the speech
enhancement system.
And also the definition
of like noisy and
the speech enhancement, next I
will shortly describe related
work using Data-driven Approach
to the speech enhancement.
And the third part is
the main part of this talk,
where I will introduce
Convolutional-Recurrent Network
for Speech Enhancement.
So in the first part,
I will also briefly discuss our
preliminary results on combining
clean speech to further improve
speech enhancement using
Semi-supervised Learning
approaches.
And finally, I will conclude
the talk with some type home
messages, which hopefully
will be useful for you.
So let's start with
the problem setup.
So usually when we have,
so here the small xt
represents the clean signal
we have in practice,
and the epsilon T is
the noise we observe.
And the noisy signal is, under
this model is called additive
noise model, it's the summation
of these two signals.
But usually they're people,
to solve this problem people
made some assumptions on
the type of noise we observe.
The two typical assumptions is
the stationarity of the noise,
which means the noise
does not depend on time,
its stage is assumed to be
stationary at each time stage.
The second one to solve which is
needed by a classic approach to
solve this problem is that we
need to assume some specific
type of noise we observe
in the noisy signal.
For example very simple
assumption is that, the noise we
observe at each times that,
is what got the noise.
And under those as typical
assumptions people And
we have developed a closed-form
solution for speech enhancement
under these assumptions,
like spectral subtraction,
minimum mean squared estimator,
or subspace approaches.
But a problem with classical
method is that they are based on
very strict,
statistic assumptions of noise.
So on one hand it's very simple,
and
it's computationally very
efficient to implement
the speech enhancement system
under those assumptions.
And we can also have
to optimal guarantees
under those assumptions.
And the third point is
it's highly integral,
because we know exactly what
the noise comes from, and
what type of noise we
are in concert with.
But the constant is
also very obvious.
The first is, it's limited
to only two station noise.
Which means that the noise
can not change a long time.
The second one is those
approaches can only deal with
noise with specific kind of
characteristics like white noise
or pink noise,
something related.
So but the question now
is what if we can collect
a very large set of datas,
a large set of datas as we know,
meaning that for each
noise signal we can also have
its corresponding clean signal.
So, what if we can have
a large amount of data set?
The answer is we can probably
using data driven approach
instead of making rigid
assumption about the noise we
observe.
Let's just collect a large
amount of data and
fit models to learn
a function approximator h.
So, more formally, a problem
we are facing now is so
we are given a set of paired
signals x hat and the x.
Here x hat means
the noisy signal and
the x means its
corresponding clean signal.
And our goal is to build
function approximator
h such that which when
h is input with x hat,
h outputs its corresponding
clean signal.
And usually in practice,
at least in Microsoft,
the scale of data we
are facing now is like,
the number of frames in our data
set is like tens of millions.
And the dimension of
each frame is around 256.
Yes.
>> So where are you getting
these paired signals are you
generating those?
>> So we have noise
collect inside Microsoft.
>> [INAUDIBLE]
>> So
talking about Data-driven
Approach, our first and
obvious approach to solve this
problem since we are feeding
regression function say why not
using non-parametric regression
also known as kernel aggression
to solve this problem.
So the problem with
kernel regression and
the relative approach is that,
so the sample complex growth
exponential rate in
the dimension of our data.
Here the dimension
is like 256 and
the exponential of it is very
large, which means we need,
very large data set for this
kind of approach to converge.
And the second problem is this
computational resource approach
does not scale with
the dimension of the data.
Because there's a internal
matrix inversion problem and
then the cert point which I
think is the most important
one is kernal regression does
not try to between each task.
Yet each task responds to
a regression problem for
one frequency beam
in other data,
but usually those frequency
they are highly correlated.
And but kernel regression
approach does not,
cannot take around advantage
of this characteristics
inside our program.
So, next approach
which is parametric.
Parametric regression,
we can use Neural Networks.
So, why Neural Networks?
Maybe a good idea here.
So, the first one is
very flexible for
representation learning, which
means by choosing different
structure of the network we can
decide the function basis we're
going to use to approximate
the optimal function app.
The second one is
it's very scalable.
So it's linear in both
the size of the data
set as well as the dimension
of our input data.
And assert y is neuro network
provides us a very natural power
learning for
multi-task learning.
In the sense that we usually
when a prime neuro network for
functional approximation
we have two steps.
The first step is to learn
a basis function which is
a common representation for
all the features.
And the second problem is
the single linear regression.
Here we have the different
regression problems, but
all those regression problem
share the same representation.
And this creates a natural task
relatedness for this approach.
Okay, before I
introduce our model, so
I will first list the sound
related work in the literature
which also used a data-driven
approach with neural network
to solve speech enhancement.
So based roughly to sets
of different approaches.
The first one is using recurrent
in neuro network, for denoising.
I think it's from Stanford,
Professor Andrew Ng's
group of 2012.
And the other line of
method using general,
deep neural network, which is
just naturally a perception,
to do frame based denoising.
Both of them use neural
networks, the first one uses
secretion model while the second
one use frame-based denoising.
Okay, so I will first introduce
the pipe line that we are going
to use for our approach, but
our talk is mainly focused
on the second part.
So every time we
observe a signal in time domain,
little xt,
what we do is we first apply
a short term Fourier transform
to transform the signal in time
domain to frequency domain.
Here the capital X is actually a
matrix where the first dimension
is time and the second
dimension is frequency b.
So each one dimensional time
signal, small xt, is first.
Transform into a matrix,
capital X.
And we apply this procedure
to both the noisy data and
the clean data.
So our head input training data
are matrices for each signal.
And the second part of the
pipeline is to build a neural
network to approximate
the denoising function etch,
such that we input x hat,
which is the noisy spectrogram.
It outputs the clean
spectrogram capital X.
In a certain part of this
pipeline is just to apply
inverse short term Fourier
transform to convert
the denoised spectrogram
back into the time domain.
This talk, I will mainly
focus on this part, i.e.,
how we should design
our neural network to
better approximate
the optimal function etch.
Okay, so let me repeat
again the setup here.
So what we have is a set
of paired input data.
Here, the capital hat x is
the spectrogram for the noisy
signal, and the capital X is its
corresponding current version.
Here, x hat represent
both x hat and
x, their metrics,
with two dimension.
The first dimension is time and
the lengths of the signal can be
different for different inputs.
But the second dimension is
the number of frequency banks
we use in the frequency domain.
So roughly each signal
has around 500 frames.
And here I just showed two
examples of a typical
spectrogram we observe.
The left one corresponds to
the noisy spectrogram and
the right one is the
corresponding clean spectrogram.
So, if I introduce our approach
there are two observations
which is very intuitive and
the guide of the design of
our model for this problem.
The first one is, so existing
DNA based approach do not
fully exploit the structure
of the problem at hand.
So what I mean by this,
so the first point is,
the frame base DNA
regression problem does
not use the temporal
locality of the signal.
What I mean is, so say for
example let's check the clean
spectrogram is clear that if
we check this Your low line,
it has a continuity along
the time dimension.
But if we just to use
frame-based DNN regression
problem to solve this problem,
because it does not have
the context information,
it cannot fully exploit
the structure of the problem.
One solution is to
use context window,
which means we're not just do
regression based on one frame,
we can pad it with context.
But this creates
another problem,
how to choose
the hyperparameter,
I mean, how long the context
window we should use.
And for different signals with
different kinds of noise,
should we change the length
of the context window etc.
But this problem can be
potentially solved by using
recurrent models.
It would have the power
of using adaptive window size
depending on the input signal.
So the second problem is
the spacial locality.
So what I mean is DNN is
a very generic problem.
Although we know that with
enough power which means with
enough hitter neurons, every DNN
can approximate any continuous
function we're interested in.
But this may require a large
the anterior cell, but
here we have a clear
observation.
So the spectrogram itself
is also continuous in
the spatial domain,
which means, for
two frequency panes, which
are continuous to each other,
the local pattern should
be similar to each other.
So this suggest
that in order to do
regression based on spectrograms
we can probably treat
spectrogram itself as an image.
So image have some
local structure,
that's the reason why to solve
image related problem people use
convolution neural network.
Okay, if we change
our point of view, so
each spectrogram
again is an image.
What it probably means is we
can first apply convolutional
neural network to spectrogram,
and then also we take advantage
of the temporal locality of
this spectrogram to apply
recurrent neural network.
>> So on the spatial how did
you mean in the time frequency
space, right?
>> Yes.
>> Okay.
>> Yeah.
>> Because technically
in the audio,
it is more direction
in the real space.
When you use a spacial
[INAUDIBLE].
>> What I mean is that
the frequencies are highly
correlated, yes.
>> Is there any relation,
any condition between the speech
signal and the noise, or
is it totally independent?
>> They're independent.
>> And is there any periodic
assumption of the noise or?
>> No, we don't make any
assumption about noise.
As I just explained,
those two observations help
us to design our model.
So the first problem
can be solved
by using recurrent neural
network that has the advantage
of using adoptive window size
instead of using a fixed one.
The second problem
can be solved using
convolutional neutral networks.
This convolution neutral
network is actually,
a sparse version of multi
layer [INAUDIBLE] and
that it has less parameters
to be trained, but
it can also take advantage of
the structure of the problem.
Okay, so before I introduce
more details I will first give
a high level introduction to
the model we are going to use so
as I just say introduced before,
it's end to end approach.
The input is the noise
spectrogram and
the output is clean version.
And in the intermediate
there are two parts.
The first part is we apply
convolution kernels to
the noise spectrograms.
And after
the convolution kernels.
We see the intermedia feature
maps into recurrent neuro
networks.
Here we use bi direction LSTM.
And we'll explain later why, I
assure this specific structure.
And after we get the output
of bidirection error scan,
we just apply a very simple
linear regression to regress
from the intermediate feature
map to the final output
which is the clean spectrogram.
And the objective function
here is very simple,
we just use which is the square
error between two matrix.
And this is differentiable.
We can use different kinds
of optimization hours and
to optimize this
objective function to
find the best parameter for
this model.
So at high level why we should
expect this model to work well,
because this model
helps us to substitute.
Observation we just made,
which is the continuity of
the spectrogram in both time and
spacial domains.
So convolution can
be endless through,
I will make this point
more clear later.
But basically convolution
neural network can be endless
through its local linear
which helps to capture local
patterns in the spectrogram.
And the next is the bi-LSTM,
actually works as
a flexible context window,
symmetric context window, which
means it can both take most
advantage of the previous frames
and also, the iframes after.
Then the last point, it's
a end-to-end learning system.
And we don't make any assumption
about the noise we observe or
the signal work going to the.
What we require is a large
amount of paired data,
and then just feed
the data to the system.
Let it run and optimize it.
So here I would just give
a very brief overview
about what convolution actually
does in here in our system.
So I know,
from the signal processing
community convolution,
the definition of convolution
is slightly different from
the definition here I used.
It's actually because the space
we are going to deal
with is symmetric.
So usually, when people
talk about convolution,
they will first
reverse the signal and
the pointwise multiplication
then to summation.
But here I just simplified it
to be linear in the product,
which means given a 2D image,
let's say it's the large
rectangle here is just an image.
Now I have a 2D kernel,
which is the small square here.
What I did is just align
the small square to each part of
the large rectangle, computing
the corresponding inner product.
Say here we have a small
convolution kernel with
size b and the w.
b corresponds to the number of
frequency pings
the kernel covers.
And the w corresponds
to the context window
this kernel covers.
So when I align this
small batch kernel
to a small kernel to a batch,
a local batch of the 2D matrix
by computing the inner product.
And then I take a nonlinear
transformation.
Any question so far?
No, and because this is slightly
different from the original
definition of convolution, the
original definition will first
transpose the convolution kernel
and then doing the product.
But since we are optimizing
the kernels, and
the space we're optimizing is
symmetric in the sense that,
if we can find a kernel
w1 that's the same thing,
we can find
a corresponding kernel w2,
which is the transpose of w1,
which gives the same result.
>> This is where you do
convolution and frequency
domain with the complex values
of the time frequency spectrum.
Then we do complex conjugate and
dual duplication
of the summation.
>> Yes.
>> In your case,
you work with
the magnitudes only.
So it's a real number, and this
is completely valid definition
for convolution in this case.
>> I think in convolution
what people usually did is,
given two
one-dimensional signal,
the first thing to do is to
reverse the convolution kernel.
And shift that kernel along the
signal to pointwise summation.
The signal, itself,
can be complex or real, right?
>> That's correct, but
the sequential domain could
convolution became
multiplication.
>> Yeah, yeah.
>> And
then it matches the definition.
You just don't do our
back if you get there.
>> That's right, yes.
>> That's fine, it's not a huge
contradiction in [INAUDIBLE]
>> Yeah, yeah, yeah.
So here what I want
to give a high level
idea of what convolution does
is because it actually computes
the inner product.
And the inner product is
basically computing a similarity
of local image batch through
the convolution kernel.
So we can effectively understand
what convolution is doing is
it's like a local
feature detector.
And the text a specific kind
of patterns in the image,
here the image corresponds
to a spectrogram we have.
And if the local pattern
in the spectrogram matches
our convolution kernel then
it has a large activation.
Yes, so
that's the high level idea.
And, so what we do in our
network is we first zero-pad
the spectrogram in the time
domain to make sure
that after convolution
the feature map we get has
exactly the same length
as the original input.
The reason why we need to
guarantee that is because
we need to feed the intermediate
feed a feature maps into
a following bio
recurrent neural network.
And the output of the recurrent
neural network should have
the same lens as the input.
>> So
you're not doing any importing?
>> I don't do importing.
So here, one convolutional
kernel will give us
a corresponding feature map.
But we can have many different
convolutional kernels.
And now, what we do is
we just concatenate
all the feature maps together.
Say each feature map is
of size t and f prime.
And if we have k different
feature maps we just concatenate
together to get a large one.
The less of the time that is
2t but size of the dimensions
of the each feature map is
k times the original size.
So after we obtained
the large feature map here,
we just feed it into
a bi-directional LSTM.
So let's first acknowledge
part very scary.
But basically what it
does is very simple.
It's a two direction
recurrent neural network.
And each cell here corresponds
to this complex diagram.
But we don't need to understand
every detail of this in
this talk.
So what required is to
understand LSTM cell is
a special kind of
recurrent neuronet.
And it's an advantage over
traditional neural recurrent
net is that it has a forgetting
gate and the memory cell.
And these two parts can help
the recurrent neural network
to decide how large is the
effect of the context window.
That's the reason
why we use LSTM.
It's pretty standard in machine
learning community now to use
this specific structure.
So after we obtain
the output of pie of
bi-LSTM, what we
do is very simple.
We just do a linear regression.
So here, W times OT where
OT is the output at
times that t of the bi-LSTM,
and we do a linear regression
of it to get the final
clean spectrogram.
But here we have
a restriction on the domain
of the spectrogram.
Spectrogram is required
to be non-active.
So we simply apply this is
a projection by of course
a coincides with
the typical function.
But here if we have different
requirements of the range of
spectrogram, say it has
a lower bound of epsilon,
we just need to change
the 0 here to be epsilon.
It still corresponds
to optimal projection.
And the final output is X t.
Once we obtain X t we just
match is with the clean
spectrogram and
computer the corresponding and
apply optimization
algorithm to solve it.
The specific algorithm we are
going to use here is AdaDelta.
But people are afraid to use
other kinds of the algorithm.
So I just introduced a system
we are going to use,
but in order to make it work,
in practice,
we need to apply several tricks.
So these tricks
are what I found to be
important in order to
make it work in practice.
And the first one is
orthogonal initialization of
the hidden-hidden transition
matrix of the recurrent neuro
network.
The reason why we need to use
this is to make sure that at
the beginning of
the optimization,
the objective function
does not load up.
The reason why we use
a orthogonal initialization
is just to make sure all
the eigen values are equal to 1.
So that even the input
sequence is very long,
it does not blow up.
The second one is learning
rescheduling with the learning
rate after 50 iterations.
We multiply by 0.1,
over 50 iterations.
And in our specific example.
We find that using
small batch size helps.
The extreme case here is
we used batch size one.
And also, it's quite important
to choose the kernel size and
its.
Because here,
the kernel size corresponds to
the context window of
the convolution and
also the number of
frequency beams it covers.
Now, yep.
>> Before we consider
the data set,
how will you do the training?
So you have the convolution
part is clear.
What else you do [INAUDIBLE]?
The kernels, you just do
the convolution and feed the?
>> Yes.
>> And then, after this,
you have a regression.
>> Yes.
>> How will you do that thing?
First, the BSTM then
the regression, or
you do this separate?
>> I do this [INAUDIBLE], yeah.
>> Thanks.
BEcause otherwise
if it has a limited
[INAUDIBLE] solution, no?
Irrigation clock?
>> Yes.
The [INAUDIBLE] Yes. Okay.
So I'm
going to introduce
the experiment.
So the specific data set we
use here is single channel,
Microsoft-internal data set.
And there's [INAUDIBLE]
at using Cortana and
that the utterance contain
both male utterance,
female utterance,and
children utterance.
And that the sampling rate
we're going to is 16k and the,
each utterance around 5-9
seconds and for this dataset we
injected manual injected noise
into the clean speech we have.
We have 25 different
types of noise.
With 377 different files.
So the training data contains
the raw 7,500 utterances.
And that we use the validation
data set to the model selection.
Which means we after each
iteration of training,
we check its performance
on a validation set.
Whenever validation error
decrease, we save the model So
at the end of training we
have the best model on
validation set.
And we use the best model
to do test on the test set.
>> So it says 25 types of noise,
is that right?
>> Yeah, yeah.
>> How do you apply that?
Do you change the debris past or
for the duration?
[INAUDIBLE] 25 different
types of noise It's fixed.
we first inject the noise
into the clean speech,
and then after that
the data is fixed.
>> So do you use type one?
Do you mix all those
types together?
I don't understand.
>> Each utterance of
those 7,500 is mixed with
given type of noise. Okay?
>> Sample from the-
>> Yeah, yeah, yeah.
Is one more component here.
So the [INAUDIBLE] features
are slow [INAUDIBLE]
with the [INAUDIBLE]
simulator operation
because of that
[INAUDIBLE] here.
Okay, so before we compare
our model with the state
of the art model in
the literature, we first need to
decide which specific model
variant we are going to use.
The different hyper-prowess
attributes are here,
like What's the size of
the convolution kernel
we are going to use and
how many hidden layers in LSTM?
And how many different neurons
in each layer of the LSTM, etc.
And also try to inject
some prior knowledge,
formalize it using a regularizer
to see whether those kinds of
regularizer will help Us or not.
So first factor we're
going to investigate is
whether there is a big
difference between single
directional LSTM or
bidirectional LSTM.
Why is this factor important
even in practice is If we decide
to use bidirectional LSTM, the
system is going to be offline,
which means we need to first
collect the utterance.
And then after we have
obtained the utterance,
we can apply our model.
But if we have the single
directional LSTM
the system can be deployed
in an online fashion,
which means Once we
see a new frame we can
immediately output
the corresponding claim frame.
And here we use nine
different metrics to
measure the performance
of different models.
So the signal to noise ratio,
SNR, is quite typical.
And also we use
the Log-spectral Distance (LSD).
Those three measures
are based on
the transcription of
the corresponding signal.
And we also measure
what error rate and
the sentence error rate but
we don't optimize.
The following ASR system.
The following ASR system
is treated as fixed,
but what we do is we
enhance our speech, and
see whether the enhancement
will lead to an improvement in
word error rate or
sentence error rate or not.
The final metric
we're going to use,
is our perceptual evaluation
of the quality of the speech.
It's called PESQ.
So the first
experiments we did is,
a comparison of single
versus bidirectional LSTM.
So what's the inference of
Symmetric contacts window,
versus a single directional
cause or contacts window.
So on these two experiments,
all the other model structures
are treated as the same.
The only difference is for
the first model,
we only use a single
direction recurrent net, and
the second model
used bi-direction.
And here the red N corresponds
to the original noisy signal.
And we don't do any
enhancement to it.
The green line here,
C, corresponds to
the clean signal we have.
And the second line
corresponds to
the final result after
optimization we obtain, both for
the single direction and
the bi-directional LSTM.
>> So does it mean that the
single direction applies well to
data there?
>> So the single direction LSTM
has only one hidden layer, and
the size of the hidden
vector is 500 [INAUDIBLE].
>> [INAUDIBLE]
>> Yes.
>> Are they consistent
comparisons?
>> The only,
yeah they are the same.
The only difference is it's
Bi-directional while the first
one is single directional.
>> [INAUDIBLE] 11 frames do
they have a frequency base?
>> Yes.
So the 11 corresponds to
the widths in time domain, which
means the context window, the
effective context window is 11.
And to frequency banks.
Which because the total
number of is 256.
And also the strike,
which means how we
move the Convolution
kernel is this image.
We don't jump a long time, but-
>> You move 16 frequency bins.
>> Yeah, I move 16 frequency
bins along the frequency bin.
I will show a comparison
of this impact later.
But the reason why is
because the frequency bin,
they're highly correlated
with each other.
So mathematically we can also
true this number to be 1, but
the result is the intermediate
featured map will be very large.
And they have a lot of
redundant information in
intermediate maps.
To find the computational
consideration,
we choose a large jump here.
So from this table, it's pretty
clear that Bi-directional LSTM
dominates the single
direction one.
As expected, because we use
bi-directional effective as we
have more information.
>> Do you mind if I
ask a quick question?
>> Yes.
>> I don't understand
the numbers.
For example if you look
at the MSE numbers,
is that mean squared error?
>> That's mean squared
error in the time domain.
>> And
SNR is also time domain SNR?
>> Yes.
>> [INAUDIBLE]
>> [INAUDIBLE]
>> Then something is a bit off,
because from the [INAUDIBLE].
Let's say the first role,
noisy, is 0.043.
And then the single
pass is 0.034.
That is roughly
a 70 b improvement.
But it shows going
from 15 to 41.
>> Yeah, but
they're not that related.
>> No,
the MSE is in the time domain.
>> That's why I asked
if the SNRs and
the time domain as well.
>> Or measure it in frequency
domain, but it doesn't matter.
>> There's a rating
function too?
>> Am I missing
something on the board?
>> Sorry.
>> It was just added
to the set of data.
>> Okay, sorry.
>> So this is basically one of
the four intros in the seed
project.
And we made sure that we
have the same data set and
the same measurement set for
all projects.
>> Yeah, cuz I see the name
goes the s versus b comparison
of course.
I was just trying to
understand the raw data.
>> Yeah.
So the MSE I show here
is measured in the final
weight format.
But the MSE I use-
>> Sorry for
the interruption, is that like
in the total weight form or
is that the MSE by frame?
>> By frame.
>> By frame.
[CROSSTALK]
>> We
used to call that segmental.
>> Segmental.
>> You have segmental
[INAUDIBLE] segmental, right?
>> Okay.
Okay, I see.
But I understand are use in
the optimization using frequency
domain, so
it's pretty different.
>> That'll explain it.
Yeah.
>> Okay.
>> [INAUDIBLE] conditions
[INAUDIBLE] estimate?
>> And also, if you don't
mind another quick question,
since you do this factual,
on the and you do the
>> Go through the handbooks and
look at the incident and
the magnitude.
When it comes time to synthesize
back the sound, you need a face.
Do you use the original face?
>> I use the original
face from the noisy file.
>> Okay, which means there is
noise in the face-
>> Yes.
>> That wasn't processed.
>> That's right.
>> Right, okay.
>> Yes.
>> Just to double check.
It's common to the pattern,
I just wanna-
>> Yeah, yeah, yeah, yeah.
>> Okay.
>> Yeah, we used it to
>> Okay, so
from this experiment you
will roughly see that
either directional out performs
the single directional.
And that the next experiment
shows the interference of
the size of convolution kernels
on the final performance.
We did three different
experiments.
The first one is the first two
using the stride 1,1 which means
there's very it's very
redundant in the frequency bin.
So, the difference
between these two is,
the window size of
the first one is five.
But the window size of
the second one is 1,
which means it's basically
a 1d convolution.
So the window size is 1,
but the number of frequency
base is other frequencies.
So it's called
the 1d convolution,
which is quite typical.
And that the first
one 2d convolution.
The difference between
the third experiments versus
the first one is that I changed
the size of the window size.
And I also make the stride
along frequency beats larger.
So, the difference between
the first answer is that
it has less intermedia
feature maps.
The intermedia feature maps for
the third one It's smaller
than the first one.
Presumably it should perform
worse than the first one but
to the large context window
compensate for that.
If we now compare
the results we obtain here,
we accept the LSD measure For
all the auto measures
that the third one
are outperforms the first one.
Which means that the impact
of contacts window here,
11 is much more important
than the stride.
I mean,
also it's quite true here.
I skipped a lot in
the frequency domain.
But because the frequency
mean they highly correlated.
It doesn't lose too
much information.
Even if I used the (1,
1), its end up performance
compared with this certain one.
>> We have, you have 2.66
in the previous slide.
>> Which one?
>> What is the difference,
2.69 here? Yep.
[Speaker
Change] What is the difference.
[Speaker Change] The size
of the [Speaker
Change] The number of cut offs,
the number of cut offs.
[Speaker Change] And also
the size of the human heirs.
[Speaker Change] I see,
so what we're trying
to do is whenever we try to
find parameters keep all other.
[Speaker Change] Yeah keep all
the other problems to fix.
It's a controlled experiment so
that we can be sure the only
difference is because of
the fact that we change Also,
this experiment shows
what the impact
of the number of feature maps,
which corresponds to the number
of convolution kernels,
have on the final performance.
So again, it's improved.
But very slightly.
Here I doubled the number of
kernels used in the experiment.
But if we check the pask and
also the whatever rate,
the reduction is minimal.
So in other foreign experiments
our best use 256
collusion kernels.
[INAUDIBLE]
where the.
>> Yes.
>> Speech.
>> So each kind of corresponds
a certain filters.
That just means I
have 256 different filters
>> Yeah.
>> Can you blow up them?
How would they look?
>> No,
it's not interpretable, why?
>> Typically when you use more
traditional neural networks in
image processing, you can
clearly see the features there.
>> That's not the feature of
the convolution kernel itself.
What people do in that case,
is They fine to the noise image,
what it means is it competes
the gradient with the respect to
the input image such that
until it finds the largest
excavation for the specific kind
of, do you see what I mean.
>> , Yeah.
>> , If I use
the import such to make
the excavation largest, So
by that way they interpret the
counter as the for that batch.
Is not.
>> So at this particular case,
this is more frequency of.
>> Yeah.
>> Than a specific image piece.
>> That's right, that's right.
But we could also
try to do that,
then try to find, what,
the latest activation parameter.
>> That will be
interesting to see,
which of those to 56 key
components from the speech.
>> So you [INAUDIBLE] all the
psychoparameter tuning it on,
really, the training data,
right?
>> Yep.
>> Like brought in the new real
world speech-
>> No.
>> But different noise
>> No,
I used the same training
validation test set.
>> And just to be clear,
the data you're using is
still real world data?
The Cortana data was for real
speakers and the noise recording
was also in real there's
nothing synthetic here?
>> I'm just curious
how it behaves on
>> Data other than that noise,
and that speech.
>> Yeah.
As there's also that, I think.
>> And again, this experiment
shows the sort of difference
between this experiment and the
previous one is, I want to see,
if I further
>> Increase the stride of
the kernel size along frequency
range to see whether has
a serious degrade and
if we compare this two but
on the other hand I compensate
for the size of the hidden area
and says roughly everything
>> This also helps me to
decide the optimal model
at the end I'm going
use to compare with the state
of the art results.
This one investigates what's
the optimal choice for this
particular data set and how many
hidden layers we should use.
And, experiments
corresponds to using one,
two, three,
hidden layers in the BLSTM.
Again, all the other
hypoparameter are treated
as fixed.
So the difference can only be
explained by the of the BLSTM.
So, originally I would expect
the SAR one to perform the best.
Unfortunately it doesn't.
I attribute the reason to the
difficulty of optimization for
deeper recurrent nets.
For this experiment
the first and
second two there
are comparable to each other.
The second one is slightly
better in terms of test.
And that's the reason why,
I picked two at the end.
>> Ask another one for per.
Usually correlates reasonably
okay with perceptual and a 0.5
difference as you're getting
there, compared to the noise.
Should be noticable.
So when you listed to
the reconstructed [INAUDIBLE] do
they sound-
>> Yeah.
I will show a case study.
I have a case study for
[INAUDIBLE] as well as ours.
>> Alright.
Cool.
>> And so another factor we
want to see is whether, where
the current model over fits
>> Our data.
To see this, I used Dropout
which is a quite standard
technique to avoid overfilling
in the deep learning literature.
So I used four different
dropout rate to see
the performance change a lot or
not.
If the performance change a lot,
that means our current
model may overfit the data.
but from the experiments'
results- again,
all the other factors,
hyper prime are the same.
But the difference- I mean,
I didn't see a clear wining
model for those four different
experiments, which
means that for
this data set our current model
does not over fit the data.
And that's the reason why at
the end I don't use dropout for
the optimal architecture.
The final experiment I want to
describe here is the effect of
using our prior knowledge to add
it into our objective function.
For all the previous
experiments, I only used the.
Is our objective function.
The one thing during
my experiments is that
the signal should be
continuous in time.
That's the observation I make.
So I manually add
a temporal difference
Into the objective
function to enforce every
two consecutive should
be close to ech other.
I add this regorizer in to the
original objective function and
do the experiments again.
Those two experiments they
use exactly the same model.
The only difference is the first
one does not use the original
MSE error, while the second one
adds this temporal rigorizer.
And then, when I see the final
result, it seems that adding
this temporal regorizer,
decrease the performance.
>> And
we try to smooth the spectrum.
This doesn't work [LAUGH].
>> Right, I try to smooth
the cleans spectrogram.
But it doesn't help.
>> [INAUDIBLE]
>> [LAUGH] He's been through it
many times.
More [CROSSTALK]
in the f way but.
>> Why?
Because I see the image of
clean spectrum brand is
continuous in time.
>> It looks like continuous time
and it may be we have if we
have a frame step of let's
say 1, 2, 40 seconds.
We jump every 16 milliseconds
and this has already creates
enough difference, yes it is
continuous in some traces but
not only across the time.
If they go across
the time frequency.
When we have formal frequencies,
they may move diagonally.
>> Yes,
they may move diagonally.
So why I say in the figure?
>> Yes, and this working
sometimes hurts the pattern
of the diagonal motion and it
creates finer after effects that
are pretty good at the tactic,
and
pass clue capture
some of that as well.
>> Yeah, I do see a degree
in most of them actually.
>> It's not too much surprising.
>> It's quite surprising to me.
I don't have the prior
knowledge of this.
>> Okay.
>> Yeah.
Okay, the last one is just to
decide what's the hyperparam for
the number of neurons
in each layer.
And also by doing these
experiments I find my best model
I'm going to use to compare
with the other state of the art
results.
So the four experiments here
corresponds to the other model I
found and
which will be used later.
It has two hidden layers.
And each hidden layer has
1,024 in the neurons.
I used 256 convolution kernels,
each with size 11 context
window and 32 frequency banks.
And the stride is again 117.
And the fourth line here
clearly achieves eight
out of nine metrics
achieved optimal result.
And actually, if we compare the
White Error Rate, it's the first
time that we see any improvement
in White Error Rate without
doing anything to the ASR
system we are going to use.
So for all the other experiments
which I performed so far,
after speech enhancement,
whatever it actually hurts.
This one see a slight
improvement,
around 1% improvement.
And the past is like
0.6 improvement.
Okay so for
all the previous experiments I
perform it to choose
the best model.
And to see what's the impact
of different hyperparams of
the model texture on
the final performance.
Now since I already have
the optimal one I would like to
compare it against the state
of the art results in
the literature.
So the first one I'm going to
compare is the classic noise
suppressor currently used
in Microsoft product teams.
And the second one is the DNN
based regression for the noisy.
Here the symmetric means I used
a context Window of 11 and
the constant Window
is symmetric,
which means I have 5 frames
proceeding the current frame and
5 frames following
the current frame.
The certain why is
the unsymmetric lock.
I use lock because
in this paper,
they describe that using
log spectrum instead of
spectrum further improves
the performance.
So as a fair comparison, I used
exactly the same setting here.
And the third one, second,
at the near end, approach.
And you start using our
symmetric context window in this
experiment.
So the context window is causal,
which means I only use
six frames proceeding
the current ones.
>> So, there's six periods?
>> Yes, so,
seven frames altogether.
For the last one I
used a Deep-RNN.
Again, I used exactly
the same structure,
the same optimal structure
described in this paper.
It has three hidden layers,
and each layer has
500 hidden neurons.
And it's a single direction RNN.
>> At this point, you started
to use a different data set or
still the same?
>> Still the same.
So here's the Result compared
with other previous state of
the art approaches
I just described.
So the first one is the current
tool we use in the product team.
And the DNS corresponds
to the metric,
DNC corresponds to causal.
SL is symmetric for [INAUDIBLE].
RN is from Stanford.
So we see that so for all
the metrics we achieve the best.
>> Can I double check
something with you?
Is that the Or
the array on [INAUDIBLE]?
>> Yes.
So we use the to get that.
We're just now past
the reference but
it's still is something.
These numbers are strange right?
The at hand, such a bad.
>> This is what
I achieved and also-
>> Just ask
>> Just weird.
>> [LAUGH]
>> Unfortunately yes.
>> None of these papers talked
about they only talked about.
It is the first time we
probably valued [INAUDIBLE].
>> So it is very common-
>> It's just weird,
from a Cortana perspective, VR
is incredibly important, right?
>> It is very common in the
speech community to state that
we don't need
speech enhancement.
Because we do
>> Activations,
then every time you try
to you actually hurt.
>> Okay, this is what,
in general,
that you look at those 55,
54, 64, 44.
Those used to be lower, back
in the times of based speech.
>> Apparently the VNL basis
speech where is the, they
handle the work area best, but
outside of that they just close.
And the other thing one
more this EMS is the speech
recognizer which was
optimized from a day
Full speech recognition.
You see, we don't get much.
We don't get much [INAUDIBLE].
>> Okay, the double.
>> Yes, that was
the optimization [INAUDIBLE].
>> Yeah, I will show some
case studies to see why these
has those problems.
>> Yet, DMS doesn't actually
improve the scenario that much,
but on the other hand, it holds
>> Yeah.
So once we see the spectrogram
we were on this network.
So here I show the spectrogram
of both noisy, clean,
and the corresponding
MS-Cortana tool.
So you see that, by checking
that spectogram the does not.
>> The speech and enhancement
vary progressively, actively.
But, on the other hand,
it doesn't hurt WER too much.
Probably just means that
three I mean, is the best.
But, let's listen
to the audio first.
First I will show
the noisy order
Who are the Seahawks
playing this Sunday?
>> And also the clean one.
>> Who are the Seahawks
playing this Sunday?
And the lesson to the MS
>> Who
are the Seahawks
playing this Sunday?
>> [CROSSTALK]
>> Remove the station
the noise actually making
the background noises
>> Way more
>> That's funny, yeah.
>> This is what you can expect,
this is very classic
>> To
do more because there was no
musical noise on Cortana,
it was a little bit of
>> This during process.
>> The optimization make
the numerously more aggressive.
Trying to minimize
the regular rate.
>> Okay, cool.
>> So if we check the spectric
line of the we see
that enhancement
very aggressively,
which means
>> Okay.
>> Yeah, you see,
it looks all the frequency.
>> Which bearing is this?
>> Stay in symmetric
11 context window.
I know that's lesson to the-
>> [INAUDIBLE]
>> So just this side of the
>> 2000 hits it will do better.
>> Yeah.
>> How about other things,
a little good
>> [LAUGH]
>> But you see the structure
that may have You do end up with
a little bit of an artificial
structure, right?
In the DNN reconstructure?
>> There was no regularization.
>> There's no regularization.
>> But what I'm saying is that
in the same way that a little
bit of artificial
structure sounds funny,
if you try to regularize it-
>> I see.
>> You may actually contribute
to artificial structure if
it does sound funny.
>> I see, I see.
And the third one is the RNN,
frost effort.
>> Who are the Seahawks
playing this Sunday?
>> It's overly aggressive.
>> Yeah.
>> We see how much of
the actual speech it cut.
>> Yeah, and the last one.
>> It did a nice job of
preserving the S's, right?
The S's sounded-
>> See above that,
there's the-
>> And then,
you see above how
they are there again.
>> The last one is ours.
>> Who are the Seahawks
playing this Sunday?
>> Wow.
>> That, yeah.
>> That's pretty.
>> Yeah, but it also contain
the background noise from baby.
But if you check the RNN one.
>> Who are the Seahawks
playing this Sunday?
>> We almost do not
hear the baby noise.
>> Yeah.
>> But in the other hand,
the distortions of the actual
speech is significant, right?
>> Yeah.
>> The ours is
much less distortion.
>> Yep, so.
>> Who are the Seahawks
playing this Sunday?
>> What was on the first slide?
>> The first, right?
>> Yes?
>> The noisy.
>> It's the noisy one.
>> Okay.
>> Do you want to listen-
>> The second was?
>> The clean one,
the crisp one, clean one.
>> This was the classic.
>> Yes,
can you show the second slide?
>> Second-
>> Second one?
>> Yes.
>> The DNN?
>> The DNN was good cut.
The third one was with
the distortions, okay.
>> Yep.
You know what?
One thing that comes to mind,
I don't know if that
makes any sense.
If you don't mind my making a
comment, is that since you took
a lot of the more raw noise,
but the baby's voice
was still left together with
the girl making the speech,
it seems that at the end, you
end up with almost two sources.
There's the person speaking and
another person like
the baby making noise.
>> Yeah.
>> Which means if you were to,
just for
fun, try to feed that to a blind
deconvolution or a blind
source separation that tries to
separate two sources from that.
But because you did all the
clean up of all of the junk that
tends to mess up with
the blinds for a separation,
I wonder if that, as an inputs
to a blind source separation,
if it could actually remove some
of the baby and get deployed.
>> In this case.
>> But I'm not sure,
it just to share with you
kinda came to mind, yeah.
>> Yeah, sorry.
>> Could you try feeding this
through again just to see what
would happen?
Just curious.
>> This one?
>> Yeah, get that signal
through a second time, see if?
>> Did that not do anything?
>> I'm not sure.
>> What's the question?
>> Just check the signal.
>> Who are the Seahawks
playing this Sunday?
>> And take that,
feed it through your
whole network again.
>> Feed it to [INAUDIBLE]
formation to see.
>> This also has been
threat many times.
>> I don't know.
>> You don't suppress from chamber-
>> [INAUDIBLE] You feed it back
to the network.
>> Guarantee her that result.
>> I don't know.
>> [LAUGH]
>> Actually I don't know.
>> Yeah.
>> And then, and actually, all
things that are kernel-based,
that tends to give you back
the same thing, right.
So that if the second test
already doesn't do much, or hurts,
>> Or hurts.
>> [LAUGH] Yes, expecting.
>> I guess-
>> General further extension of
her idea record is to have
the sound source separation into
the neural network itself,
asking to produce two apples and
to maximize the statistical
of the [INAUDIBLE].
>> I see, instead of-
>> And just to
>> Yeah, yeah, and
that meant might make sense.
>> During the training.
>> I see, yeah, but
that is kind of tough,
because you're making a hard
assumption of that two sources,
which we could say
that after listening.
>> And we're using all of our
brain to put the nearest part,
it'll be two sources.
But in real world,
how many sources would be
the right number, right?
And now, you have yet another
hyperparameter that is costing.
>> Well, in general,
all body aggressive
because think about this.
The speech recognize the word,
recognize the is ra,
ra is not the speech.
Because it won't fit the
phonemes and it will ignore it.
If you make it more aggressive,
you'll cut the baby crying, but
you'll cut the important
information that's important
to the speech.
>> The stuff that is important,
yeah, that is [INAUDIBLE].
>> I'm completely fine
with less aggressive.
Less aggressive noise suppress
original Brings better
can better speech
recognition results.
>> Makes sense.
>> So, we finished the first
part of the top which is
the model design.
Find the best model for
speech enhancement.
The second topic I'm going
to discuss about is,
where can we further boost
a performance of speech
enhancement by using only
clean speech from TIMIT data?
Here I call it semi-supervised
learning with clean signals.
So the setting here is besides
the supervised paired dataset
we have, additionally
we have unlabeled data
which corresponds to
clean speech from TIMIT.
And one thing that way to
incorporate this part of clean
data into the existing system
to further improve the result.
So here are a very
straightforward way to try in
the first place is to use the
clean speech to do pre-training
of the model.
And then after pre-training,
we just applied
the supervised learning
we just described before.
So here I, again,
do two experiments.
And the only difference between
these two experiments is
in the baseline,
we don't do any pre-training.
But for the second experiment,
I used TIMIT data
to do pre-training.
In the sense that the input of
the network is the clean speech,
and the output should
also be the clean speech.
And then by doing this
effectively we are using our
neural network instead of doing
speech enhancement is doing
recurrent auto encoder which
tries to preserve the clean
speech.
Okay, but surprisingly-
>> So hold on, hold on.
You train TIMIT, TIMIT in clean.
>> Yes.
>> And then without any
further training, you go and
do the processing of
>> No, no, no.
>> Or you train
>> After I do
the alternate code,
I'm using TIMIT pre-training.
I again use the Cortana data.
>> It just means
a better starting point-
>> Yes.
>> For your optimization.
>> Yes, so
I should make it clear.
If the model is comebacks,
say it doesn't matter.
Clearly pre-training
does not help.
It will only help to
faster convergence.
That's the best we can help.
The same scenario
now is non-convex.
It's not clear whether
pre-training will bring it into
a better local optimal or not.
So I do these experiments,
but unfortunately it hurts
the performance a lot by using
TIMIT data to do pre-training
in the first place.
You see it's almost
0.2 degrade in.
The model itself
are exactly the same.
So the second one, so from
these experiments, we conclude
our pre-training fails,
in this case to use enable data.
For second one, what we do
is we use a hierarchical
denoising autoencoder.
What does that mean?
Let me explain to you.
So let's first ignore
the second part.
Traditional frame-based
denoising just a input noisy
signal, and
the output predict signal and
do a aggression
was the clean one.
But besides doing so,
we can make
it hierarchical denoising
in the following sense.
In each training phase,
I input both the noisy signal
and the clean signal, but
they go through exactly
the same structure.
After each non-linear
transformation,
I get intermediary
representation of both signals,
both clean in the noise.
And I add
the corresponding MSE loss
into the final
objective function.
>> So you tell the network at
each stage with the noise signal
to be as close to
the signal as possible?
>> Exactly.
So that's the hierarchical
denoising autoencoder version.
I also I draw two parts here.
They showed exactly
the same network.
And by doing this,
we can also utilize the TIMIT
clean data in a following way.
So the clean signal here is
again the clean TIMIT data.
But for noisy one at each layer,
I manually inject
the white Gaussian noise.
>> [INAUDIBLE]
>> So that's the way I try to
also use TIMIT data
in this structure.
And here is the results
we obtained.
You see there's two difference
between these two experiments.
The first one is, so
the H experiment here
tries to use TIMIT with
injected white Gaussian noise.
And also use
the intermediary MSE loss.
Which enforce the model to
be a hierarchical denoising
architecture.
So these are the two
major difference, but
the model itself
The number of neurons,
number of different layers
are fixed to be the same.
Again, we see a serious
degrade in performance.
So for
all the experiments I did so
far, it seems every time I
want to incorporate my prior
knowledge about this task,
hurts the final performance.
The best way to do to achieve
the best performance is simply
enter in learning without
any additional assumption or
additional prior about
the underlying task.
It seems every time I
want to inject regulizer,
it almost hurts
the performance a lot.
The third try we made is
to use mixture models
to help using
the unlabeled data.
So here,
in high level what we do is we
try to use the unlabeled data to
create different clusters
of the input frame.
And then we partition our noisy
data into different clusters.
And it would be different
models for different data sets.
The only difference between
those four different
experiments is that I
use different data to
create cluster at
the first place.
For second experiments,
I use noisy Cortana data
to create cluster center.
For third one,
I use clean Cortana data.
And for the last one,
I use clean TIMIT data.
Now let's see the result.
So, as we can check,
using mixture model indeed
helps boost the performance.
I mean, quite uniformly, if we
compare the last experiments,
which use clean TIMIT data,
compared with the first one.
But also again,
if I just use the noisy or
clean Cortana data to
create a mixture model,
I also see an improvement
over the baseline,
which means that using
a mixture model indeed helps.
Speech enhancement but
the source of success is not
due to the clean TIMIT data.
It's due to the more complex,
more powerful metro model.
Okay, so I think I finished
the two major parts of my talk,
a brief conclusion
of the whole talk.
So I would say the
convolutional-recurrent in neuro
network is especially good for
speech is very good for
speech enhancement.
Because the convolution and
the recurrence help to capture
the structure of our data,
along both the time domain and
also the frequency domain.
And our model on
the Cortana data set
achieves an improvement
of PESQ 0.6 on average.
And also without fine tuning
the following ASR system,
we improve the WER rate by 1%.
I would imagine we
can make it larger,
even a larger amount of data.
And for the second part the
semester provides learning using
TIMIT data.
So far we don't see
a successful approach
to use the clean TIMIT data to
further boost the performance.
I would, from the certain
experiments here,
I would say use mixture/ensemble
model combined with
the convolutional-recurrent
neural network
will achieve even
further improvements.
That's it, the end of my talk.
>> [APPLAUSE]
>> Any questions?
>> Can I make a quick comment.
>> Yes.
>> I was surprised by
the auditory quality on.
Noise suppressors for
a long time, and
I'm doing simple things
without any machine, for
machine learning, and doing
just simple that to filters and
spectral subtraction and
things like that.
You end up fighting
with musical tones and
other things like that and
you didn't have any noticeable.
There's also the spectral
components that go in and
out and produce a typical
whooshing effect and
you have very little of that
which is also interesting.
And I would suggest that
the particular example that you
chose to show us in terms of
playing sounds and all that,
that is cool.
It would be nice because that's
still something that bugs my
head to this date since doing
some experiments 30 years ago,
which is, you play the noise,
and you play the enhanced one.
And in some cases,
including yours,
this particular clip
may not show, but
other clips may show that when
you play the enhanced one, it is
actually easier for a person
to understand what is said.
Which means that if
you have that feeling,
it means that to some extent,
the work error rate or
the fatigue rates for
the human brain got improved.
So how come that that VR
in the speech recognition
didn't get improved?
>> Wasn't even right?
I mean, the strain on the data.
>> Right, but the fact that it
actually proves to a person,
it's a significant thing.
Because it means, it usually,
I remember when I
was at PictureTel,
doing things like a conference
analysis, there is this fatigue
factor when you have too much
preparation or too much noise.
And if the clip shows that in
few a bit more comfortable
listening to it, then the
fatigue factor, after two hours,
let's say, of being in
a meeting where you have try to
understand, maybe
the fatigue would go down.
So it might be interesting
that if you have
some clips that give
that impression,
then it is a little easier for
a person to understand.
>> Yeah.
>> It might be cool to actually,
with the HCI guys,
run a fatigue experiment.
Because if you show
a reduction in fatigue,
that is actually another metric
that is rarely discussed.
By remembering the video
conferencing business,
people would complain to us.
That's why for, a typical
example, is why they pay extra
money to do white bin
instead of telephone calls.
Cuz that's very easy to measure.
Fatigue on telephone
calls is very high.
Fatigue on white bin
is not that high.
And we would say,
it costs to $3,000 more
to have everybody paint.
Because they knew [LAUGH] that
you'll get tired of a two-hour
meeting on a telephone
quality and
it improves quite
a bit on wide band.
That is too obvious,
because, obviously,
it's a huge
difference in meddle.
But here, on the actual speech
quality and being classical and
say, if it includes fatigue, it
could actually be something that
you could sell to Skype and say,
look, [LAUGH] besides Cortana,
if it's not about yard play,
but it's a fatigue play,
it's something that Skype
could be interested in, right?
>> Well,
these are very good comments.
>> Yeah, yeah.
>> Just to address what you said
in the beginning about using
classic suppression techniques
to optimize physical tones and
stuff, right?
In my opinion and
my experience, after all this,
I think they have
their own benefits.
I think the major benefit
those techniques bring in is
generalized ability.
So and this falls and in space
are it suffers quite heavily
than you put in an unknown
situation than the nice
conditions which has the model
has not seen, right?
So it becomes very difficult for
it to adapt or
as well as to train it.
It takes a lot of time.
>> It's not just
like [INAUDIBLE].
>> Exactly, yeah, yeah.
>> And I also-
>> Yeah.
>> I guess one of the reason why
the WER drops after speech has
made because after
the enhancement distribution of
signal, it's different from
the training of the ASI systems.
>> ASI systems, yeah.
>> That is exactly how I
remember conversations with
Alex, who used WER, and
he would say exactly that.
>> Okay.
>> You messed up the PDFs, but
no matter what the machine
learning I''m doing now,
I'm just smashed.
>> Yes.
>> But
the good question would be if
you do speech recognition on
the noisy data, to some extent,
that machine learning
engine has learned that.
Now retrain the speech engine
with enough samples of your
enhanced thing,
>> Yeah.
>> And now, run through that.
>> Yeah.
>> Could you then
get an improvement?
>> I would expect,
I would expect.
>> The answer might be yes.
And if a yes,
then it's a good thing to do.
>> Yes.
>> We do this root connect?
>> Right.
And you once against, right?
>> Today, when you download KDK,
there is a kinect acoustical
model, which was the output
of the pipeline.
>> So-
>> Yeah.
>> But it was a little
bit more complicated.
So first, we train in the audio
pipeline using desk and
speech recognizer thing
that will clean speech.
So literally,
the speech recognizer was
another quality measure.
This transmission between
the clean speech and
whilst this happening,
we rule against, we obtained in
the speech organizer,
got the new acoustical model,
and we adapt some our
substantial drop in the work
that were rate with
the new acoustical model.
So we tried it one long time,
but this time,
we didn't get a new problem.
>> So I want [INAUDIBLE] with
not with new cognitive services,
if they allow you
to that much work.
>> Retrain it.
>> Retrain the engine.
Because if so, that would
be a natural extension
in terms of experiments.
So just at least try [LAUGH] and
see what happens.
>> So we configure them and
retrain them the service,
I think we should try that,
we could do that.
>> Yeah, how-
>> But look, if you do that,
then the real solution is match
this with the front end of
the speech recognizer and train
them both with the noisy speech.
>> Yeah.
>> Potentially, but at least you
will have an avenue to enhance
the final performance.
>> So more of I feel like one of
the biggest, hardest to adoption
of this technique is operating
in unseen, unknown conditions.
>> Yeah.
>> That's one of the biggest
challenges I think the whole
community should work on.
>> Yeah.
>> Because people have
worked on [CROSSTALK]
>> Yeah.
>> [LAUGH]
>> Initially,
just specify error on
adaptational transfer learning
to do this.
>> Yeah, so that's probably
where I think people
should put effort.
And this kind of, it's almost a
solved problem in my opinion, or
near to be solved.
>> [LAUGH]
>> Well, this is pretty cool.
I mean, I was coming to
the logic thinking, yeah,
one more paper, one [LAUGH] more
presentation on voice reductions
and it'll be full of artifacts.
I was surprised.
It sounded pretty good.
>> We have to more
excited about it.
>> The second from
the front ops.
>> All right.
>> This is one of the guys.
Remember this-
>> I need to come to
the office in front.
[LAUGH]
>> [LAUGH]
>> But they are not now,
aren't they?
>> No.
>> No.
[CROSSTALK].
>> September.
>> Okay, so it's [INAUDIBLE].
>> [INAUDIBLE]
>> Okay, thanks, thanks!
>> [INAUDIBLE]
>> So let's thank.
