[MUSIC]
>> Then, I guess we can start?
Yes. So welcome everyone to
the final internship
presentation talk
of Yangyang Raymond is here.
He's doing his PhD at CMU;
Carnegie Mellon University in
Pittsburgh with
Professor Richard Stern.
He worked here a few months with us
on real-time single-channel
speech enhancement
with recurrent neural networks.
With that, the stage is yours.
>> Okay. Thank you, Sebastian.
Thanks for everyone for
coming to my talk today.
I'm going to present you the work I'm
doing for the past three months
in collaboration with Sebastian,
and the title is Real-time
Single-channel Speech Enhancement
with Recurrent Neural Networks.
Let's get started. So
throughout the talk,
I'll be first introducing
single-channel speech enhancement.
Formulating the problem will
go over not only the methods
based on deep learning but also
the classical signal
processing methods.
Then we'll move on to
our method based on
a recurrent neural network
and connect what we
learn from classical
signal processing
to our decision in
building our network.
We'll be doing a thorough evaluation
on a bunch of objective
speech quality measures,
and on a large-scale dataset
collected with the help of
Raj, [inaudible] , and Harry.
Finally, we'll be concluding
the talk and reporting
some major findings.
Let's get started with
the introduction.
So what is a single-channel
speech enhancement?
Simply put, single-channel speech
enhancement aims to reduce
noise and retain speech quality
to the best extent possible
from noisy speech.
Our overall assumption is
that our noisy speech comes
from addition of a clean speech and
the noise signal and there is
no other assumed distortion
like non-linear distortion or
channel distortion or reverberation.
The other general assumption we
make is the noise attributes
typically change slower than
speech and our goal is,
as I said before, to suppress noise,
to retain speech to the
best extent possible,
to improve human or
machine perception.
In this project, our goal
is for the end-users,
the human listeners who are going
to listen to the enhanced clips.
So that will be our focus.
Let's have a overview of the generic
speech enhancement pipeline.
On top, we have
the flow diagram of
a classical signal processing-based
speech enhancement system.
We start with our time-domain
waveform signal x of t,
and throughout the
talk, I'll be assigning
x to all the noisy signals.
That signal goes through a
short-time spectrum analysis,
typically the short-time
Fourier transform,
to get the short-time
spectral characteristics.
After that, we separate
the short-time spectral
features into phase and
into magnitude denoted by
these little blocks there.
One challenging aspect of
single-channel enhancement is that
the phase is typically
very hard to recover.
So that is out of scope of
our talk today as well and
will be leaving it as it is,
the noisy phase for reconstruction.
We do the majority of our
work in the magnitude domain.
You see there are some generic
modules to estimate noise
or estimate the gain from the
magnitude of the noise spectrum.
After that, we send
into an estimator,
will be called a gain estimator
that applies basically
a gain function in the
frequency domain on
each frame of our noisy spectra.
We use that to point wise multiply to
our noisy speech and
use that enhance magnitude
to recover the clean speech.
That is the generic pipeline for
over 30 years until deep-learning
arise in the scene.
The basic pipeline is
similar in the sense that we start
with our time-domain signal.
We do some feature extraction which
doesn't have to be spectral features
anymore, it can be anything.
But our end goal is still to estimate
this time-frequency gain function
denoted g with a hat there.
Then point wise, multiply that to
the noisy magnitude to recover
the clean speech, hopefully.
As you can see here everything
else in the middle becomes
more or less a black-box
because of the neural
networks. Thanks to that.
Of course, with deep-learning we
have training data to leverage.
That's a huge advantage of
the modern machine learning
approach compared to
the classical approach.
Now, I would like to
focus in this talk
on several aspects we can improve
in this deep-learning box.
The first is feature extraction,
and the neural network itself,
the learning objective, and how
we actually train our system.
Our method will be
broken into four pieces.
Before I get into the actual method,
let me briefly go over
this short chart I picked.
As you can see, we have
six methods right here.
The first two are from a classical
signal processing-based method.
The middle two are deep
learning-based but cannot
operate in real-time.
The last two rows are deep
learning-based and can
actually operate in real-time.
As you can see I'm highlighting some
of the key methods
that determine whether
or not the method can be
real-time process or not.
As you can see things like
spectral subtraction,
the key part is estimating noise
by a moving average filter and
for decision directive
method you have
a recursive smoothing of the
instantaneous measure of SNRs.
Now, these measures can actually be
done if we drop the assumption
of the online processing.
Things like noise estimation
actually can be improved
massively if we incorporate
information from the future.
But that's how the scene was set up.
We want to keep that because
a real-time processing is
I think our ultimate goal.
We have a speech coming in
and we enhance it without
looking into what's
in the near future.
It's not that something
they cannot do is just
they keep the assumption as
real-time online processing.
Although these two are
deep-learning method
here have done very well,
they use information from
the future which is
breaking that assumption,
and for that reason,
we are focusing our model on
an online single-frame in,
single-frame out basis.
For that reason, the last two methods
are the qualified candidates
we want to compare to,
and the last one is not really
real-time because they are
trained on a one-second waveform.
All right. Let's jump
into the method part.
So as I said from the flow diagram,
we break our method into four parts.
The feature representation,
the learning machine,
the learning objectives, and
how we train the network.
Let's start looking at the
first thing, the feature.
We use the most standard
feature for a neural network,
that is the short-time
Fourier transform magnitude.
We also consider the
short-time log power spectra
with a negative 80 dB floor.
What you see on the left is actually
the log power spectra with a
linear mapping of a color and
displayed with a jet
color map in MATLAB.
Let's see. We have three,
just for the ease of looking,
we have in x-axis
your time in seconds,
your y-axis in
frequency in kilohertz.
But we have three
spectrograms stack together.
The top one, we have the noisy.
I think that's with the air
conditioner noise at 20 dB.
In the middle, we have
the clean speech signal,
and on the bottom,
we have this weird looking IRM or you
call it the ideal ratio mass
which is the ideal gain function.
You plot it in 2D on
a dB scale because if I plot
it in between zero and one,
you would not have
to see the contrast.
As I said before,
the output we're
trying to estimate is
the real magnitude gaining function,
the range between
zero and one and some
technical details about how we
construct this spectrogram.
We have a 16k Hz sampling
rate for our audio.
We used a 32-millisecond
analysis frame with
a Hamming window and
a 75 percent overlap.
>> Hamming window or Hann window?
>> Hamming window.
>> The four percent [inaudible]
window from the zero?
>> Yeah. Not the Hann window,
the 0.46 plus 0.54 times the cosine.
>> That's Hamming.
>> Yeah.
>> Hamming window is from
zero to one to zero.
>> Okay. It's not Hann, its Hamming.
>> Okay.
>> Yeah. It's a raised cosine.
>> Have you tried different
windows, different [inaudible] .
>> I briefly tried out 20
millisecond 50 percent overlap
and the performance went
down for the network.
That was a month ago,
and I set it aside and never
really changed my original setup.
Yeah. But I think it'll work.
Well, the overlap might be a
problem but 20 millisecond window,
I think it'll work.
>> So why do you use
the log power spectrum?
>> So the question I got is why
I use a log power spectrum is;
our perception is correlated
on a log scale for audio.
As you can see, we are actually,
visually we can see the contrast if
we mapped linearly the value
obtained from the log scale.
If I did it for just the magnitude,
the contrast will be so low that you
wouldn't even visually
see that difference.
>> But the magnitude
actually itself already
contain the information
for the power, right?
If you just double
the magnitude I mean
just it comes back
to its power, right?
>> Well, the log power is
just a non-linear compression
on the linear power.
>> Okay.
>> We'll actually do a
comparison by feature later.
So you'll see the result. Yeah.
>> Dynamic range of audio
signals is extremely high and
that's why we usually use
a logarithmic [inaudible].
So we assume that
neural networks will deal better
with slightly compressed data.
>> Yeah. We'll see.
>> Most people find.
>> In addition to the
feature I mentioned earlier,
we are exploring
some different normalization
techniques on audio features.
The very first is
your standard global mean and
variance normalization by frequency.
The statistics is
accumulated over 80 hours
of random sampled speech
from our training set.
In addition to that, we also
explore online mean and
variance normalization
as Sebastian mentioned,
the dynamic range of
speech is very high
and it changes drastically over time.
One way to deal with it is to
smooth the spectrum
in time and we apply
a three second
exponential window either
globally or per frequency basis.
As you can see from the left here,
the top graph shows the
original noisy spectra.
The middle one is
the spectra after frequency
dependent normalization
and the bottom figure here is
after frequency
independent normalization.
The absolute value or the
color is not important.
The contrast is more important.
Okay. All right.
So after the features we are getting
into probably the most important part
of our system which is
the learning machine.
The neural network itself.
The recurrent neural network is
the most natural choice for us
because what the
recurrent neural network
does is it outputs some value for,
it has a notion of time
first and foremost.
Then it outputs something
for this time instant
based on some input you
obtained for the current
time and also from
the output you get from
the previous time stamps.
Which is similar to what we
do with the filtering in
or the classical approach we
have with speech enhancement.
So that's the basic
structure we are based on.
One well-known example recently
that uses RNNs for speech
enhancement is called RN Noise,
you can check out the paper in with
the reference in the
last slide there.
The network has,
I would say a pretty
complicated architecture,
I'm not going to get into that.
But there are two things
that caught our attention.
The first is the use of
gated recurrent units
which has proven to learn long-term
temporal patterns effectively.
The second is a dense layer
which really acts like
a long non-linear transformation
block to bring your feature
from a long-term sequence
into the gain function
at that time instant.
So we take their ideas and
what we realize is that
it's important to have
this residual connections
in the network somewhere.
The residual connections
facilitate deep networks to learn.
There's a very famous
paper a couple years ago
where the person was in
the field of computer vision and
he's doing some image
classification task.
What he found is that by
having this simple residual
which means you're,
imagine you have
multiple layers and you
are simply adding the input from
the very first layer
to the later layers of
the network and that facilitates
learning a extremely deep network.
I think tens of layers,
something like 20 layers.
That's in computer vision
and in our case here
the depth of a network
actually corresponds to
the number of time-frames,
I'm going to explain that later.
Because we are going to train
the network with a
very long sequence,
we believe that the residue
would help within the network
and that's what we decided to do.
This is, on the right,
you see a standard gated recurrent
cell and what we did was we
simply add this bypassing
connection from the input
to where they add aggregate all
the learn components and propagate
that into the next layer.
After we do that we did
some literature research and
found out there was actually
a same idea has been applied
in different task already.
There's a sequence classification
and probably the most similar one
is Chen 2017 paper on
future compensation.
What they did is they
estimate this mask for
a Mel-spectrogram for
speech recognition.
It's for enhancing the speech
features as well but it's
not used to reconstruct speech.
So we have believed that
this block will do well.
But we don't stop there,
we still need an entire
network that's going
from our input feature to
the output gain function.
What we did is simply stocking a few.
In our case, just three grew layers
with our residue connections.
If you zoom in on each block
it will look like this,
except for the last layer where
we don't add this residual.
The justification is that for
input features we're getting
something from audio which has
a very high dynamic range
and all the output that come
in all of the [inaudible] is
compressed already in the last layer
before it gets transformed
by a fully-connected layer.
We don't want the input in
dynamic range mess up with
what's getting learned inside.
So we don't have that.
Everything else has the
residual and in the end we have
a fully connected layer
with a sigmoid function.
The outputs are again
between zero and one.
That is our network architecture.
Now let's move on to the actual
learning objective which
is probably equally as
important as the Network.
We adopt the well-known
mean squared error.
Okay, so I have inconsistency in
the notation here so X here is
a clean speech in short-time
fourier transform magnitude
and Y is our noisy speech and we're
applying a gain function
to the noisy speech.
Just simply take the point-wise
squared error and averaged them
across all time and frequency.
>> So this is the mean squared
error and the difference
between clean and estimated signal.
>> Yes.
I would like to bring
some context about minimum
mean squared error.
There's a seminar work by
Efren and Malah in 1984.
The problem by assuming complex as
TFTs of speech and noise
have Gaussian distributions
and are uncorrelated
and they solved for
the optimal solution in minimum
mean squared error sense.
For deep learning based approach,
we actually don't have
any assumptions about
distributions of anything
and we simply learn
by Stochastic Gradient descent
and hopefully we get to a point
where it's low enough for
training and low enough for test.
The mean squared error has
a staple convergence because if
you take the gradient of a square,
you have a Linear Gradient
across everywhere.
>> Raymond?
>> Yeah.
>> The short term Frequency
Spectra Transformation is
Efren and Malah in 1984 but
actually in timed minutes,
Robert Winner in 1947.
>> That's right.
>> [inaudible]
>> That's right. Yeah, I was
going to emphasize the Gaussian.
>> Yes.
>> Yeah.
>> But Winner actually is based on
mean square error in time
domain signal, yeah it's true.
With that observation, we can
rewrite this mean squared error.
If we put in statistical form
in expected value instead
of actual average.
If you just rewrite
a little bit and we
ignore the cross term there,
we ended up with two,
that's up by the way that's a very.
>> Very cold.
>> Very coarse assumption
that maybe doesn't hold.
But for the, our goal is to
separate speech distortion from
noise suppression and by
ignoring the two terms,
what we ended up is actually
the mean square error
between the signal enhanced.
Well, the S here is the clean signal,
so it's a mean square or
between the clean signal and
the clean signal itself
multiplied by the gain function.
We have a mean squared error of
just noise multiplied by the gain.
So because we are not solving
for any optimal solutions in
statistical sense and also we want
to balance the speech distortion
and noise suppression terms,
we first did this approximation
and then we come up
with this new Loss Function
that have two separate terms.
The first one is on
a speech distortion and the second
term is on noise suppression.
So the way to interpret this is,
let's say your Enhancement System
does nothing which means
they're going just simply pass
everything then your
speech distortion is
zero but then you have all the
error coming from the noise.
Then by having a enhancement
system that suppresses everything,
you have zero for the gain function
and then you get no error for
the noise suppression and you get
all the error from the speech.
>> So technically the
first and most to keep
the gain as close to one as possible,
and the second thing wants
to make the gain as low as
possible to suppress more noise.
>> Yes.
>> If you want to balance.
>> Yeah.
>> You do so with Alpha.
>> Yeah.
>> Okay.
>> For the speech, we only do that
for the speech active region.
So we apply a simple Energy-based
voice activity detector
on the speech here and
the detector is simply a
thresholding on the energy
accumulated from three Kilohertz
to 5,000 Hertz which is,
typically where speech happens.
>> Can we get that on a plain speech?
>> Yes. Yeah, that's a very
crude Energy-Based VAD,
if we do our noise feature
will probably fail.
We don't stop there.
We also have this observation from
classical signal processing
point of view that when we
have a noisy signal that's almost
clean then we don't want to destroy
any speech content in there.
So the result is we pass
almost all the noisy speech on change
to retain the speech quality.
When there's so much noise
in speech that we
cannot even get a hang
of where the speech is we just apply
a very heavy suppression
on the entire thing.
So that basically says,
when the SNR is approaching
infinity we want
very little speech distortion and
when SNR is approaching zero,
we want very aggressive
suppression on the noisy speech.
Motivated by this observation,
we have another loss function
built on top of the previous one.
With this SNR terms multiplied
to each part of the
loss function there.
I have to mention here that
the original intention was to
view this as a whole term.
So this is the waiting
for speech and one
minus the other multiply
with this parenthesis here,
that's the original intention
but our Tensor implemented that.
But not to confuse
the audience and for the result
here I'm showing the
result coming out of this,
but one easy future work is to try
with a corrected weighting
but we keep it as it is here.
>> Well these just kills both
terms with the same number.
>> But this is by example.
So imagine you have a batch of audio.
>> Okay. There your
global cost function
will be weak and same, okay.
>> Yeah, but still it will
be more correct if you do
it with waiting on
the Alpha by the SNR.
>> Sorry, why is it there
is no Sigma in that square,
the lower right square?
>> That was a typo, sorry. That
should be the square. Thank you.
All right.
So from the classical
decision directed
approach from Efren and Malah,
we have hidden state,
Priori and a Posteriori SNRs as
your hidden states in
deep learning language.
The hidden states from the
previous estimate affect
the current by a exponential
smoothing process.
We have this analogy in
the RM based approach.
But what we have is a blog walks
almost with hidden states that
we don't know the meaning of
the hidden states they carry.
But we know that they are capable
of learning very long
temporal sequence.
They're learning through
back propagation
through time m. We want to
actually study the effect of
the length of the sequence
we pass in because this is
just a simple pseudo recurrent
neural network I have here.
So this is your hidden state from
previous time frame and the
hidden state from the current.
As the output here and your input
is x of t and your
output some y of t here.
Let's just say your hidden
state of t simply equals to
your output and your output is simply
a function of your input plus
your previous hidden state.
Then if we take the partial
derivative of the output
with respect to the learning
parameters of the network,
we see it's a function of
your current instantaneous
gradients multiplied by
something from the previous
time frames and this T here.
Here, I have from T all the way
back to zero but we can control
this length and see how
it affects the impact,
how it affects speech quality
of the enhanced signal.
So we are doing this
comparison as well.
Okay. That's the end of our method.
Now, let's move on to evaluation.
We have 84 hours of training data.
The clean stage comes from the
Edinburgh 56 speakers corpus,
the noise comes from 14 noise types
from DEMAND database and Freesound.
For test, we have 18
hours of test clips.
The clean speech come from the Graz
University 20 speakers corpus.
So there's no overlap
to training at all.
For noise, we are picking
nine challenging classes from
the 14 in training,
but we have different
signals for test.
Those are very
challenging noise types.
For example, we have the competing
talker in neighbor and we
have transient noise
such as munching,
or the door shutting,
and a airport announcement.
>> Noise clips in
the test data are not ever ever
presented in the training set?
>> No. Right? No.
>> Okay.
>> Yeah. We have five
different combinations of
SNRs from 0 db to 40 db with
a 10 dB step and all clips
are sampled at 16 kilohertz.
This is just a close up
look of the data we have.
On top, we have clean speech.
On the bottom, we have the noise.
This is our waveform plotted in
dB and you see this is the same noisy
repeated five times here
from 0 db to 40 db there.
We have the speech normalize
to the same level,
but it's the same speech
repeated five times.
During training, what we did is
augment data a little bit by randomly
drawing a segment of waveform
from any clean speech file.
From noise file, we do
the same and we mix them.
So the SNR wouldn't change is
still the five discrete SNRs.
But now, you have different speech
mixed with different noise.
>> Do we have these
[inaudible] in one file?
The drastic change of the noise,
can we have a segment is like?
>> Yeah. That's possible.
>> Thanks.
>> Yeah.
>> [inaudible] and
conditional speech.
>> It's probably a good thing
for the network too. Yeah.
>> It makes you get to
set a bit more difficult.
>> Yeah. It might be even better
to mix with different
SNRs on the fly.
So we'll just draw one noise
and draw one speech and mix at
a randomly draw SNR level,
but we didn't try that.
So that's our data and we have
quite a few systems to compare.
We start with a noisy unprocessed and
we have the statistical based.
This is the signal parsing
based method developed here
in MSR without training
data of course.
We have our proposed method
here with these set up,
and we have a recurrent
neural network which is
simply our network but removing
the residual connection.
So we want to study how effective
that is residual
connection actually is.
Everything else stays the same.
For RNNoise, we use the original
code published by our Valin 2018,
because they have a package to
do all the training and testing
and we couldn't augment the data,
so we just keep the data as it is.
By our experience, the data
augmentation of the [inaudible] in was
about 0.1 test score and
they don't have this.
Keep in mind that this number will be
lower than what it
potentially could be.
For simplified RNNoise here,
what we did is we simply took
their network architecture.
Theirs is enhancing
a very crude energy
and we'll hope it's a 22-band,
but we have a full band 257.
So what we did was took
their architecture,
scale up the future dimensions to
match it for full band and scale up
all the other dimensions
within the network
to accommodate this
scaling difference.
We don't use data.
They are training with
a voice activity detector as well and
we didn't use that there
because we don't have label.
>> [inaudible] or as output?
>> So what they did is
they have it as output and
a train it with label. Yeah, yeah.
>> So do we have this in the
programs? Architectures.
>> The VAD? No. For the
proposed method it is kind of
building to the learning objective
because of the speech
distortion. Yeah.
>> It is in the RNNoise,
the original RNNoise.
But we found [inaudible].
We randomize that and it
didn't change anything.
>> [inaudible].
>> Sorry Donald. Yeah, your mic.
I can't hear you. [inaudible]
>> Let me know the question.
Finally, we have Oracle information
plus Wiener filter rule,
which marks theoretically
the best what we can do.
So we have seven systems to compare.
In terms of evaluation metrics,
we have four classical speech quality
or in intelligibility measures.
They are the scale invariant
signal to distortion ratio,
which is a really
robust version of SNR.
We have captured distance which
is a distance metric
in the capture domain.
A capture domain is supposedly,
you have a flattened channel
and speech dimension.
For the third short-term objective,
intelligibility this is
in terms of percentage.
Finally, the perceptual
evaluation of speech quality or
PESQ which predicts a mean opinion
score of a speech quality.
Except capture distance,
everything else is better
with higher value.
Capture distance is
better with lower value.
We also incorporate
this new DNN based Mean Opinion
Score prediction called
Audio Moss is trained on the moss
scored by real users and
it has a 0.89 parsing,
correlation, coefficient,
and sound test data.
All right. Ready for the results?
>> [inaudible] one more thing.
>> Yes. Some experiments
learning on average.
>> An added question on the
RNNoise when you begin play.
>> Yes.
>> So if you Google an
original [inaudible] bands
when defining or
assuming [inaudible].
Do you change that or
is it still using that?
>> Everything I think is
16-kilohertz something rate.
>> Yeah but then it changed the code,
because that's something the
frequency is something that
48 and a kind of critical bands
has faced assuming 48 kilohertz.
So you have to write a new resolution
in the 0-48 kilohertz range.
>> Okay.
>> Because if that
didn't happen then it's
probably not a valid comparison.
>> So this is the original
RNNoise it is not
the one but I'm sure you modify.
>> Yeah.
>> Then it's not. I
think we cannot convert
that 16 kilohertz input because of
the frequency bands are
not meant for that.
>> Okay.
>> There's a better version of it.
>> The frequency band is
based on the 0-24 kilohertz.
>> So the one that we
used in our speech here,
that one was a modified RNNoise.
It performed way better
than the original one.
>> Yeah. I actually didn't
remember that something
rate because I did
this baseline almost
three months ago. Yeah.
>> Basically, if you use
the RNNoise on GitHub
and used it at 16 kilohertz speech I
would not include it in [inaudible].
>> Okay.
>> We do have [inaudible] for
this version as to [inaudible].
>> Okay.
>> Yeah. We have I guess
suggested results.
>> Well, the simplified
RNNoise but the full band
enhanced measures still being
a talk because we changed
the architecture.
>> Because you are not
using any [inaudible]?
>> Yes.
All right. So in terms of result,
let's first look at the
best from each category.
The best is surrounded by
quotation marks because I pick
the best based on
the best PESQ score.
I'll show later that none of
those objective measures
actually are optimal,
but let's start with
this comparison first.
As you can see here,
the first thing to
notice is that our noise
has tremendously less number
of parameters because of
the less dimension from
the crude energy contour.
Our system has 1.26 million
trainable parameters,
but put that number into perspective,
it can actually enhance
a one second of audio
in 39.6 milliseconds on a single GPU,
with the GCR machine I
used just using Python
and with the CPU
that's 2.6 gigahertz.
So it's well within the real
time processing constraint.
In terms of objective measures,
we can find that our method
outperforms other systems
in all categories.
I'll explain this in the
next few slides here.
But again, I chose this one just
because of the absolute
best PESQ score
obtained from the test data,
but this is for the human listeners.
So we want to listen to how
it actually sound like.
So let's start with the noisy.
>> She had jumped away
from his shy touch like
a cat confronted by a side winder.
He had left her in-violent thinking
familiarity will gentle her in time.
>> Can you hear that?
>> It should if it
goes to the speakers.
>> Wow I could hear it.
>> Yeah I can hear it too.
>> Okay.
>> So that was a bubble noise
at 20 dB and this is based on.
>> She hadn't jumped away
from his shy touch like
a cat confronted by a sidewinder.
He had left her in-violent thinking
familiarity will gentle her in time.
>> That's based on classical
signal processing.
I'm going to skip a few here
and let's listen to the
full band RN noise.
>> In a way, he couldn't blame her.
She had jumped away
from his shy touch like
a cat confronted by a side winder.
He had left her in-violent,
thinking familiarity
will gentle her in time.
>> Our proposed method.
>> In a way, he couldn't blame her,
she had jumped away
from his shy touch like
a cat confronted by a side winder.
He had left her in-violent,
thinking familiarity
will gentle her in time.
>> Okay. We did some comparison
on feature normalization and we felt
that normalization in general helps,
but maybe not as much as we expected.
We expected to be.
Yeah I'm going to skip
through this part,
and on the effect of sequence
lengths we found that five.
Yeah.
So the first four bars
are based on our short-term
spectral amplitude.
The next four bars are
based on long spectra.
Here we have the original spectra
after global normalization,
after online frequency-dependent
normalization,
and after online frequency
independent normalization
and the same for long spectra.
Using exact same network architecture
only difference is the feature.
>> Global normalization, [inaudible]
the green loud spectrum,
global normalization
is actually the best?
>> Yes. This is based on
mean-squared error only,
not R, not the speech
distortion, weighted off.
>> Raymond you mentioned that
the clean speech is roughly the same.
>> Yeah. I've either-
>> But in reality this
is valid if you have
a [inaudible] microphone or
you are roughly the same
distance from the microphone.
But if you are using
a local microphone you can be
half a meter or five meters away,
you only have a 20 db
difference in the voice level.
>> Yes.
>> That dynamic range is where
the normalization would
tremendously help.
>> Yeah.
>> Not here. So this
is of not much value.
Any conclusion here is.
>> That's true.
>> Okay.
>> [inaudible]
>> Speech is always
at the same level.
>> Yeah, that's true.
>> You need augment the data with
the whole input with
different levels.
You have an knob for
your microphone, right?
You can pull it down
20 dB or crank it up.
You don't know at which
time you get the audio,
so you need to augment it.
>> This is the dynamic range
which normalization battles.
>> That's true. That's true.
That's true, yes. Good point.
We also study the effect
of sequence lengths when
we train the system.
So basically for every batch we
are feeding one minute of speech,
but it could be 61 seconds segments
or it could be a two
30 seconds segments.
What we found is that I had to stop
this after 53 epochs because it was
taking almost a week
and is very slow.
But what we saw is
the last row here is
not a fair comparison but what
we found is that five second
is actually a good number.
Definitely better than
one second per segment.
So this length actually,
and we're surprised by the
recurring units actually
are able to learn from such
a long sequence there,
we have a eight millisecond overlap.
So that means over 600 frames
and it's still able to
learn and learn well.
>> So sequence length is what?
The training examples.
>> Yeah, the number of frames
in the training example.
To a randomly sample of waveform
transform in regular, yeah.
Now the more interesting results
are from the two loss
function we had.
Remember we have this
speech and distortion
weighted laws by this
term alpha there,
and we're doing a sweep of alpha
between zero and one to see
where the optimal value is by
different objective measures.
What we found is that
first of all you see this,
every measure has a nice shape.
It starts something bad,
goes to somewhere optimal
and goes bad again,
happens to every measure.
Second, they don't
agree with each other.
For speech and noise weighting,
we found that all demos
think this which is actually
very high speech distorter here,
let me play an example.
>> Those answers will
be straightforward
if you think them
through carefully first.
Drop five forms in the box.
>> It's very aggressive [inaudible] .
>> Yeah. You can
almost hear no noise,
but the speech distortion
is quite large.
>> This maximizes best.
>> No, this maximizes the audio most.
The DNN-based prediction.
>> [inaudible]
>> Let's listen to what
PESQ thinks the best is.
>> Those answers will
be straightforward
if you think them
through carefully first.
Drop five forms in the
box before you go out.
If people were more generous,
there would be no need for work then.
>> It's right here. So the
speech quality is better,
but you hear this residue
of noise that happens
when the speech occurs
at the same time.
I'm not going to play the other
two because they are worse.
Capture this in the next
story, agree with each other.
>> Can I have the 0.65?
>> Okay, sure.
>> Those answers will
be straightforward
if you think them
through carefully first.
Drop five forms in the
box before you go out.
If people were more generous,
there would be no need to work then.
>> So it's much more
noise than the previous.
>> There's not much
improvement in the quality.
Can you play the noisy one?
>> No, it is. The speech
is not [inaudible].
>> Those answers will
be straightforward
if you think them
through careful first.
Drop five forms in the
box before you go out.
If people were more generous,
there would be no need to work then.
>> If you want to compare
PESQ and audioMOS
because they're both
predicting MOS scores,
you see this, they're agreeing more.
So in this end,
there's almost no suppression.
So they agree more than when
you get here when there is
a heavy speech distortion then
they start to disagree a lot.
>> The other wasn't trying
on architect like that.
>> Yeah. So that's
something to work on.
We also have this
SNR-weighted Weighting here.
Again, similar trend and they
don't agree with each other.
Again, audioMOS prefers
very heavy suppression.
>> We welcome many new
students each year.
George is paranoid about
the future guest storage.
The carpet cleaner should include-
>> No, that was the speaker.
>> Please shorten the
script for choice.
>> PESQ and SDR here.
>> We welcome many new
students each year.
George is paranoid about
the future gas shortage.
The carpet cleaner should
include our oriental growth.
Please shorten the script for choice.
>> If you just go-
>> We welcome many new
students each year.
George is paranoid about
the future gas shortage.
The carpet cleaner should
include our oriental growth.
Please shorten the script for choice.
>> So this is only one example.
Had to say it works equally
well on the others.
But my preference is around
that 0.2, 0.3 range.
All right. That's getting to the end.
We have some major findings
from all the experiments here.
The first and foremost is
residual connections
really, really help.
If you just compare the RNN
with our proposed method,
the only difference is the
residue conduction and
this makes a vast difference.
We are surprised to find
that the recurring units,
in this case Groose but
probably LSTMs as well,
are able to encode
extremely long patterns at
high-dimensional space. Yes.
>> I'd want one comment on that.
You compared one and five,
you didn't compare one to
two, the second segment.
So it's possible that it
learns more than one,
but it doesn't actually learn
all the way down to five.
>> Maybe.
>> Anyway.
>> Yeah. But still for one second,
it's already 125 frames.
So it's able to really learn from
a very long-term temporal patterns.
There's a point when I was looking at
the enhanced gain function in dB and
saw this constant suppression
at six kilohertz.
I was wondering what's
going on until I
saw this example in training.
We have vacuum noise
and there's just a tong
around six kilohertz and where
speech don't usually occur there,
so the network turns
out it just learns
to suppress that
frequency very heavily.
For stationary patterns like that,
there might be room to incorporate
the classical processing.
>> It's a preprocessor.
>> Yeah. But whether or
not you can detect that
tone, it's another story.
>> Jamie, come on,
use the suppressors.
>> Not a very good idea
>> Yeah.
>> So you get that separation of
the one that's even not there.
>> Yes, in any hertz.
>> It means there's not enough
variety in the noise data.
Just as one vacuum cleaner
[inaudible] vacuum cleaners.
>> Yeah, data augmentation
will definitely help there.
Another major finding
is that by having
the SNR Weighted or the Speech
Distortion Weighted objectives,
we are able to enhance the speech
not with the broadband masks.
So the problem with before,
it was almost acting like a VAD where
it suppressed everything when
there is no speech present.
When there is speech,
you just let open the frequency
and let everything go.
But with the new weighting
function we have,
it's much more selective in terms of
frequency of what the
enhancer is able to suppress,
and also by listening
we confirm that.
So in conclusions, we propose
a DNN-based online speech
enhancement system
with a very compact neural network.
The storage complexity is
linear function of your
feature dimensions squared.
So by reducing that number,
you can have even smaller networks.
We introduced two novel
learning objectives
motivated by balancing speech
distortion and noise suppression.
Thanks to Ross, we found a couple of
days ago that one of
the weighting function,
the first one, there is a paper
published like seven days ago
that has that in the paper.
The other one is still new.
So let's hope by the time
we write a paper we-
>> [inaudible]
>> Yeah.
We study the impact of
multiple factors associated
with training a neural
network for speech quality,
and we explore feature normalization.
But as Evan said, we need
more variation in the
data to confirm that.
We study the effect
of sequence length
and the two objective weightings,
and we compare competitive
signal-processing based and
deep learning-based
online systems in terms
of objectives speech
quality measures,
and ours performed better
than the system's
mentioned in this talk.
The future directions of this work
involves studying the
speech quality improvement.
But SNR, those numbers reported
before was the mean of everything.
But by analysis on different SNRs,
we might find different
patterns because
our objective is a function of SNR.
We will explore more
learning objectives
to replace the mean squared error.
There are some measures we tried
before that we thought works well
in classical processing sense,
but it turns out
didn't work so well in
the neural network sense probably
with the issue with training.
That's another path to explore.
We can also explore to
reduce the dimensionality
of the feature to reduce
the model size and improve
the computation speed.
That's the end of my talk.
I want to thank all of
you for coming here,
but I want you to thank in
particular my mentor, Sebastian.
It's a real pleasure
working with you.
I want to thank Hannis for providing
multiple tips on using Filly.
Ninety percent of the
experiments shown
today wouldn't be
possible without Filly.
So thank you for that
and thank you for
all the other mentors in the
audio and acoustics group;
Evan, David, and Dmitri.
I would like to thank Ross,
Chandon, and Harry who's
not here from Skype.
>> He's online.
>> Sorry.
>> Harry is online.
>> Oh, hi Harry. I
would like to thank
them for preparing the training and
test data on my first day here
and they gave the data to
me and made that much
easier for me to work.
Thank you for organizing
the event, that was great.
Thanks for the suggestions we
have through our weekly meetings.
Finally, thanks for all the interns
and I'll be missing your
company when I leave.
I'm staying far. Thank
you very much. All right.
>> It's time for questions.
>> I have one question really.
Before you started out,
it looks like you decided to
adopt the same architecture as
a conventional nicer version in
terms of making the magnitude
and the matrix in the space.
Do you look at time
domain of process where
you operate the idea of uniform or?
>> I think we decided to take
this masking approach in
the first week because we are
aware of the time domain enhancement,
but we think that's a completely
different research problem.
Yeah. That's certainly a
direction to look for.
But we probably need
a different set up
than what we have today to do it.
>> One more question.
Otherwise if there's not,
let's thank Raymond, again.
>> Thank you.
