So, so far we've seen the power of just using a simple multilayer perceptron to solve a wide variety
of problems,
but you can take things up a notch, you can arrange more complicated neural networks together and do
more complicated problems with them,
so let's start by talking about convolutional neural networks, or CNN'S for short.
Usually you hear about CNN's in the context of image analysis and their whole point is to find things
in your data that might not be exactly where you expected to be.
So technically we call this "feature location invariant,"
that means that if you're looking for some pattern or some feature in your data, but you don't know where
exactly it might be in your data,
a CNN can scan your data and find those patterns for you wherever they might be.
So, for example, in this picture here, that STOP sign could be anywhere in the image and CNN is able to
find that STOP sign no matter where it might be.
Now, it's not just limited to image analysis,
it can also be used for any sort of problem where you don't know where the features you are might be
located within your data, and machine translation or natural language processing tests come to mind for that,
you don't necessarily know where the noun or the verb or phrase you care about might be in some paragraph
or sentence, say, you're analyzing, but CNN can find it and pick it out for you.
Sentiment analysis is another application of CNN,
so you might not know know exactly where a phrase might be that indicates some happy sentiment or some
frustrated sentiment or what whatever you might be looking for,
but CNN can scan your data and pluck it out. And you'll see that the idea behind it isn't really as
complicated as it sounds.
This is another example of using fancy words to make things seem more complicated than they really are.
So how do they work? Well, CNN's, convolutional neural networks, are inspired by the biology of your visual
cortex,
It takes cues from how your brain actually processes images from your retina and it's pretty cool,
and it's also another example of interesting emergent behavior.
So the way your eyes work is that individual groups of neurons service a specific part of your field
of vision,
so we call these "local receptive fields,"
they are just groups of neurons responding to a part of what your eyes see, it subsamples the
image coming in from your retinas, and just has specialized groups of neurons for processing specific
parts of the field of view that you see with your eyes.
Now, these little areas overlap each other to cover your entire visual field
and this is called "convolution." Convolution is just a fancy word of saying "I'm going to break up this
data into little chunks and process those chunks individually," and then they'll assemble a bigger picture
of what you're seeing higher up in the chain. So the way it works within your brain is that you have
many layers, it is a deep neural network, that identifies various complexities of features, if you will.
So the first layer that you go into from your convolutional neural network inside your head might just
identify horizontal lines, or lines at different angles, or, you know, specific cut tines of edges,
we call these "filters," and that my fit into a layer above them that would then assemble those lines
that it identified at the lower level into shapes, and maybe there's a layer above that that would be
able to recognize objects based on the patterns of shapes that you see.
And then if you're dealing with color images we have to multiply everything by 3 because you actually
have specialized cells within your retina for detecting red, green and blue light,
and we need to assemble those together as well, those each get processed individually, too.
So that's all a CNN is, it is just taking a source image, or source data of any sort really, breaking it
up into little chunks called "convolutions," and then we assemble those and look for patterns at increasingly
higher complexities at higher levels in your neural network.
So how does your brain know that you're looking at a STOP sign there?
Let's talk about this in more colloquial language, if you will.
So like we said, you have individual local receptive fields that are responsible for processing specific
parts of what you see, and those local receptive fields are scanning your image and they overlap with
each other looking for edges.
You might notice that your, your brain is very sensitive to contrast and edges that it sees in the world
those tend to catch your attention, right?
That's why the letters on this slide catch your attention because there's high contrast between the
letters and the white background behind them.
So at a very low level you're picking at the edges of that STOP sign and the edges of the letters on
the STOP sign.
Now, a higher level might take those edges and recognize the shape of that STOP sign says: "Oh! There's
an octagon there, that means something special to me," or "those letters form the word stop, that mean something
special to me, too,"
and ultimately that will get matched against whatever classification pattern your brain has of a STOP
sign,
so no matter which receptive field picked up that STOP sign, at some layer it will be recognized as
a STOP sign
and furthermore, because you're processing data in color, you can also use the information that the STOP
sign is read and further use that to aid in his classification of what this object really is.
So, somewhere in your head there's a neural network that says: "hey! if I see edges arranged in an octagon
pattern that has a lot of red in it and says stop in the middle, that means I should probably hit the
brakes on my car," and it's some even higher level where your brain is actually doing higher reasoning,
that's what happened,
there's a wire that says: "hey there's a STOP sign coming up here, I better hit the brakes in my car," and
if you've been driving long enough, you don't even really think about it anymore,
do you? Like it's almost hardwired
and that literally may be the case.
Anyway, a convolutional neural network and an artificial convolutional neural network works the same way,
same exact idea.
So how do you build a CNN with Keras?
Obviously you probably don't want to do this at the low level TensorFlow layer, you can! But CNN's get
pretty complicated.
A higher level library like Keras becomes essential.
First of all, you need to make sure your source data is of the appropriate dimensions of the appropriate
shape, if you will, and you are going to be preserving the actual 2D structure of an image if you're
dealing with image data here,
so the shape of your data might be the with times the length times the number of color channels, and
by color channels
I mean, if it's a black and white image there's only one color, black and white, two to only have one color
channel for a greyscale image,
but if it's a color image you'd have three color channels, one for red, one for green and one for blue,
because you can create any color by combining red, green, and blue together,
OK?
Now, there are some specialized types of layers in Keras that you use when you're dealing with convolutional
neural networks,
for example, there's the Conv2D layer type that does the actual convolution on a 2D image,
and again, convolution is just breaking up that image into little subfields that overlap each other for
individual processing.
There's also a Conv1D and a Conv3D layer available as well,
you don't have to use CNN's with images, like we said, it can also be used with text data, for example, that
might be an example of one dimensional data, and it's also a Conv3D layer available as well if you're
dealing with 3D volumetric data of some sort,
so a lot of possibilities there. Another specialized layer in Keras for CNN is MaxPooling2D,
obviously there's a 1D and 3D variant of that as well.
The idea of that is just to reduce the size of your data down,
so if I take just the maximum value seen in a given block of an image and reduce it to a layer down to
those maximum values, is just a way of shrinking the images in such a way that it can reduce the processing
load on the CNN. As you see processing CNN is is very computing intensive,
and the more you can do to reduce the work you have to do, the better.
So if you have more data in your image then you need, a MaxPooling2D layer can be useful for distilling
that down to the bare essence of what you need to analyze.
Finally, at some point you need to feed this data into a flat layer of neurons, right? At some point it's
just going to go into a perceptron,
and at this stage we need to flatten that 2D layer into a 1D layer, so we can just pass it into a layer
of neurons, and from that point it just looks like any other multilayer perceptron.
So the magic of CNN's really happens at a lower level,
you know, ultimately it gets converted into what looks like the same types of multilayer perceptrons
that we've been using before, the magic happens in actually processing your data,
convolving it and reducing it down to something that's manageable.
So typical usage of image processing with a CNN would look like this,
you might start with a Conv2D layer that does the actual convolution of your image data,
you might follow that up with a MaxPooling2D layer on top of that that distils that image down, just
shrinks the amount of data that you have to deal with,
you might then do a dropout layer on top of that, which just prevents overfitting like we talked about
before,
and at that point you might apply a Flaten layer to actually be able to feed that data into a perceptron,
and that's where a dense layer might come into play, so a dense layer in Keras is just a perceptron really,
you know, it's a layer of a hidden layer of neurons.
From there we might do another dropout pass to further prevent overfitting and finally do a softmax
to choose the final classification that comes out of your neural network. And like I said, CNN's are compute
intensive, they are very heavy in your CPU, your GPU and your memory requirements,
shuffling all that data around and convolving it adds up really, really fast
and beyond that, there's a lot of what we call "hyperparameters," a lot of different knobs and dials so
you can adjust on CNN's.
So in addition to the usual stuff you can tune like the typology of your neural network, or what optimizer
you use, or what loss function you use, or what activation function you use,
there's also choices to make about the kernel sizes, that is the area that you actually convolve across,
how many layers do you have, how many units do you have,
how much pooling do you do when you're reducing the image down,
there's a lot of variance here, that's almost an infinite amount of possibilities here for configuring
a CNN, and often just obtaining the data to train your CNN with is the hardest part.
So, for example, if you own a Tesla, that's actually taking pictures of the world around you and the road
around you, and all the street signs and traffic lights as you drive,
and every night it sends all those images back to some data server somewhere,
so Tesla can actually run training on its own neural networks based on that data,
so if you slam on the brakes while you're driving a Tesla at night, that information is going to be fed
into a big data center somewhere and Tesla is going to crunch on that and say: hey, is there a pattern
here to be learned of what I saw from the cameras from the car?
That means you should slam on the brakes,"
in the case of a self-driving car. And you think about the scope of that problem, just the sheer magnitude
of processing and obtaining and analyzing all that data, that becomes very challenging in and off itself.
Now, fortunately the problem of tuning the parameters doesn't have to be as hard as I described it to
be.
There are specialized architectures of convolutional neural networks that do some of that work for you,
so a lot of research goes into trying to find the optimal topologies and parameters for a CNN for a
given type of problem and you can just think this is like a library you can draw from.
So, for example, there's the LeNet-5 architecture that you can use, that's suitable for handwriting
recognition in particular;
there's also one called AlexNet, which is appropriate for image classification,
it's a deeper neural network than LeNet, you know, so in the example we talked about on the previous
slides we only had a single hidden layer,
but you can have as many as you want really,
is just a matter of how much computational power you have available. There's also something called GoogleLeNet,
you can probably guess who came up with that,
it's even deeper,
but it has better performance because it introduces this concept called "inception modules,"
it basically group convolution layers together and that's a useful optimization for how it all works.
Finally, the most sophisticated one today is called ResNet,
That stands for Residual Network,
it's an even deeper neural network, but it maintains performance by what's called "skip connections,"
so it has special connections between the layers of the perceptron to further accelerate things, so sort
of like builds upon the fundamental architecture of a neural network to optimize its performance.
And as you'll see CNN's can be very demanding on performance.
So with that let's give it a shot!
Let's actually use a CNN and see if we can do a better job at image classification than we've done before
using one.
