Audio is a huge field and it's
actually arguably the field that really
started the interest in deep learning so
we are just gonna scratch the very very
surface of audio in this video and what
I really want to show you is that we can
take the exact same techniques that we
applied to text and image classification
and apply it to audio now it's not
totally obvious how you do that right I
mean like audio comes in really
different format than like an image or
text right basically typically we
represent it as kind of like a wave or
maybe two waves if you have stereo sound
so how do we actually get it in a format
where we can process it and what do we
do with it you know
audio files are tend to be big and it
tends to be just complicated to ingest
them and handle them so I'm gonna do a
very very small classification example
the ideas we want to classify people
saying different specific words and
we're gonna see how well we can do that
with some really simple caris techniques
so here's the task we want to classify
sounds and the sounds are people
speaking and we classify them into what
the person is saying so I found online
WAV files of various people saying the
words bed happy and cat and actually
there were a lot more sets of WAV files
there so you can follow a link we'll put
in the comments to download more if you
want to classify different words and
what we're gonna do is we're going to
take those WAV files do some
transformations on it and then run
various types of neural nets to see how
well they classify this data so you know
first of all we do this standard sort of
importing libraries like Keras and
actually a pre process library that I
mostly copied from another audio
processing git project and that's as
things like transform the WAV files into
spectrograms
so the next thing that's going to happen
is we set the number of buckets in our
spectrogram and we set the length of
time that we want to operate over and
then we use a function from this
pre-processing library to transform
these wav files into something that
looks more like a sonic spectrogram now
you may not have seen a spectrogram
before you can find lots of apps that do
this in a spectrogram the x-axis is time
typically
and the y-axis is the frequency of sound
and then the darkness is the amount of
energy at that frequency so in music or
in science you typically get these
spectrograms that have sort of even
intervals or logarithmic intervals
between the frequencies but actually
when you're processing speech and
there's a slightly different
transformation that people typically do
called MFCC and so that's the one that
I do here but you can just roughly think
of it as buckets of frequencies and kind
of buckets of time so we do that
transformation and then we actually load
the training and test set into the
familiar X train X test Y train Y test
values this is just like you know
previous videos X train was off in a set
of images in previous videos in this
video it's going to be sets of audio
spectrograms essentially an X test is
going to be validation data for that Y
train is going to be the labels so 0
corresponds to bed one corresponds to
happy and two corresponds to cat and Y
test is the same but corresponds to the
the test data then we're gonna actually
reshape our our data a little bit we're
gonna add a channel element and this is
because typically with audio you're
gonna have a left channel and right
channel now in this case we've actually
removed the channel so there really is
only one channel but this might make the
code a little more generalizable to
typical audio files that you'll see out
there in the wild and then you know
before we do anything else I think it's
nice to take a look at the data that
we're dealing with with the imshow
commands and now that works super well
when we're dealing with images right you
can actually look at the imaging see oh
that's a number 4 or oh that's a picture
of my friend's face with audio
spectrograms
it's a little less clear what's going on
but it's kind of nice to look at anyway
so we could you know look at the
hundredths value of X train and we can
see that it seems like it starts off a
little quieter and maybe gets a little
bit louder it's a it's a little hard to
interpret we can also print out the
corresponding Y train label and see what
that what that was and it looks to me
like it must be the zeroeth label and
that would be bed so this is this this
is some kind of distorted spectrogram
with somebody saying bed one more thing
before I get
we have to transform why train and why
test into one hot versions of those so
we talk about this a lot in in previous
videos and you can find it there but
essentially going from this single
number to a vector numbers where the one
corresponds to the label that we want
and you then you know as usual we're
gonna start with kind of the simplest
possible model and in that case it's a
perceptron so as usual we're gonna first
call flatten to kind of remove all the
structure of our data so the buckets in
the length of the channel are going to
flatten it all that out into a single
vector and then we're gonna call a dense
layer on that and that's going to be a
fully connected layer and within this
case three different outputs one
corresponding to each word that we're
trying to classify and the typical
softmax activation function we use when
we're trying to do multi-class
classification we're gonna use
categorical cross entropy as usual and
the adam optimizer and we're also gonna
report on accuracy in this case all
right so let's let's fit that model and
you can see that in this case because
the data sets reasonably small the model
runs quite fast but you know you can
actually see that this very simple
linear model gets us around
approximately 80% accuracy on the the
validation data which is not bad okay so
now here's the really cool thing because
we have our data in such a standard
format we can actually pull from all the
different types of models that we've
built in earlier videos to make this
model better so the first thing we can
try and this is something that people
really do we can apply a convolutional
network to this now you might argue that
maybe we should use a 1d convolution
more like text and you can try that
right because maybe you know each you
can think of each frequency as a
separate channel but because actually
the channels do have meaning or the the
frequencies do have meaning like two
frequencies close to each other actually
are kind of semantically close I think a
two D convolution is a reasonable thing
also to try so let's start with that and
you can find in my ml class videos
directory you can actually find examples
of all these different classifiers so
let's actually just go into cnn.py
and see what happens when we paste in a
standard
kind of one level convolutional neural
network so we can just copy this model
code right into our notebook here and
now we just have to change the input
shape to be buckets len and channels
and we can just set this to be a 3x3
convolution so the dense layer size to
128 we can compile the model in the same
way and then we can fit the model in the
exact same way and again because it's
such a small number of samples it learns
very fast let's take a look in the app
and actually this model is very very
good right so this model gets over 90%
accuracy ninety-three ninety-four
percent accuracy on our test stated it's
right off the bat which is really cool
we've actually taken the machinery that
we've learned in different domains and
applied it this totally different domain and
the same intuition that we had that you
know convolutions might work better
actually turns out to be the case and
you might think well if one convolution
works well what about two convolutions
so we can take this same thing that we
did before and take a convolution and a
pooling and then a second convolution and
a pooling build this model here
compile it
and actually we could go into the go
into a project we'd call this one
perceptron we're gonna call this guy one
convolution call this guy two
convolutions
so you can see here that our two
convolution model is actually slightly
better than our one convolution model
which is awesome it's like you know
maybe a 94% accuracy versus a 93%
accuracy but you know another thing is
pretty glaring which is that this is the
test accuracy and on the training data
both the one convolution and the two
convolution model have 100% accuracy
right so you know it seems like we have
an issue with overfitting and again we
can apply all the intuitions that we
learned on text and image data to this
problem right and so the the clear thing
to do when you see this the first thing
to try is to add some dropout so let's
put a little bit of dropout in our model
if you put it in the same place that we
did before so we can see model add
dropout maybe dropout 1/4 the stuff and
drop out a quarter of the stuff again compile
the model and run fit
and you see that the two convolutions
just drop out is actually learning
slower on the training data but it
actually kind of continues to improve
and the same thing happens on the test
data right so it starts off a little bit
worse but as it runs over time it gets
better and better and better right so
this dropout actually allows the model
to fit the data even a little bit better
than without the dropouts so all the
things that we expect all the theory and
intuitions that we've learned so far
they apply it to audio equally as well
as images or that which I just think is
super cool maybe let that run a little
bit then there's one more thing that you
can try which we did on text which is we
could take LSTMs or GRUs and apply
it to audio right and this might make
sense especially if we had variable
length audio files are much longer audio
files this might make sense I think
actually CNNs probably make a little
more sense for these tiny files where
they run well but let's take a peek and
see how they do so we can copy the code
from our LSTM video
and so when we copy the code and you see
that actually we get an error and it's a
it's a shape error and it's because LSTM
expects a two-dimensional input not a
three-dimensional input and so you get
the scary error message in this case
though remember we actually added the
channel variable later so we could do a
more complicated reshape but I think the
simplest thing to do is just undo the
reshaping that we did before and then we
can try the LSTM
now the LSTMS
performance is significantly worse than
the convolutions but that might be
because we had a small LSTM it also could
be the fact that our data is actually
not very long and I think LSTM's would
matter more as the data gets much longer
so we could spend some time really doing
hyper parameter tuning and maybe get
this LSTM to the same accuracy as CNNS
but I'll just say for these kind of
short audio files I think you know CNN's
are gonna be faster train faster and run
faster and probably the better choice
but if we were classifying really long
conversations that's where LSTMS
might really shine I guess my biggest
point here and we can go deeper in
subsequent videos on all types of audio
processing but the the big point that I
want to make is actually the stuff that
you're learning is really transferable
across domains I mean domain expertise
has a huge role to play here but this
stuff with CNNS is surprisingly
transferable in many different areas and
I think that's just super exciting so
we'll do some more videos on audio well
sue some more videos on more complicated
architectures can't wait to do them!
