all the examples that we think about
when we talk about you know machine
learning from like you know like hot dog
not hot dog to like you know predicting
the stock market - you know like
understanding speech all these things
that you really think about classifying
things you've already seen right when we
look at the MNIST data set where we're
talking about labeling numbers 0 through
9 but we're not talking about
recognizing other digits that we've
maybe never seen before right in the
real world it's actually common that you
want to classify something where you've
literally never seen it and humans can
do this really well right when I see a
spatula for the first time I recognize
it as a new object that it that that
maybe I don't know the name for it but I
recognize it that it's a thing and I can
recognize if I see it again oh and
that's that thing that I saw before so
how do we make computers do this kind of
classification and this is an example
that's called sometimes one shot or zero
shot or sometimes a few shot if you have
a couple examples and it's really been a
challenge for machine learning in
general to make this work and one of the
approaches that I think is really
exciting that I want to talk about today
is a kind of reframing of the
problem right so instead of looking at
one object and saying what is this
object and training on that what I want
to do is train on pairs of objects where
the question is are these two objects
the same and now what's so cool about that
is that if instead of building a
classifier of one thing at a time
I build a classifier of pairs of things I can
actually look at something new and look
at one example maybe a canonical example
of that thing and say is it that thing
right so my classifier can potentially
generalize not just to classify the
things that it's seen in the training
data but maybe to classify anything that
it might see out there and this
technique is really generalizable I'm
gonna do it on images but the same
approach can be used on video it can be
used on audio it can be using tons of
tons of different examples so let's get
to it all right so let's walk through an
example of how we're gonna do this and
I'm gonna do this on the MNIST Data set
at first just because it's a data set
you're probably familiar with from
previous videos and it's really fast to
run experiments on so first you know we
have the requisite lots of imports and
then we're going to load the data
and we're gonna load it exactly the same
as we've done in previous videos where
we load the data into the Train, x train
is the images Y Train is the labels on
the training data X test is the imagism
test data and then Y train is the labels
for the test data and then we're going
to normalize just like we've done in a
lot of other videos where just divide
the values by 255 so that our pixels are
between 0 & 1
instead of being between 0 and 255 but
now we're gonna do something new we're
gonna call this function that I wrote
called make pairs and what make pairs
does is it takes in input data and
labels and it makes a new kind of data
set where the data set is actually pairs
of images and here the label is actually
one if the two images correspond to the
same category of thing and 0 if they
correspond to different categories of
thing so I just wrote a little bit of
code here that actually just randomly
walks through digits and then picks
other digits that match and then adds a
label of is the same thing and then it
finds two that don't match and adds a
label of not the same thing so what
comes out of this is a data set where
half the images are same things and half
the images are different things and so
this function at the bottom creates a
new variable called pairs train which is
going to be the pairs of images and
labels train which is going to be either
a zero for not the same thing or one for
the same thing so we can run this here
we got to load the data first then we
can run this guy and you know as usual I
always recommend taking a little peek at
the data so why don't we look at pairs
train here four comma zero and it turns
out that is a number four that's just an
accident
the fourth data set is a four and then
if we look at pairs train four comma one
that's actually a different looking four
so we would expect then labels train
four to be a one meaning that they're
the same image so why don't we just
print that out labels train
four and yep they're the same image we
could look at maybe the 400th example of
Pairs train so here that's a one and
then if we look at the adjacent one it's
a another one a little bit different
writing but the same the same thing so
we've transformed our data and now what
are we gonna do with it right so you
know naively one thing we could do is
actually just pass in each image into a
separate dense network and then
concatenate those and have a final dense
layer to predict same image or a
different image so that's what we're
gonna do here so here our first
sequential model is just a flattened and
then a dense layer so this is just the
perceptron that you might be used to but
we're gonna use a relu activation
function because it's kind of an
intermediate piece and then we're
actually gonna have the exact same layer
but a different set of weights and now
here's a new layer you might not have
seen before but super useful
it's called concatenate so what that
does is actually just takes two layers
the outputs of two layers and puts them
together into a single set of
activations so no parameters it just
combines the two and then the final
layer I'm calling dense layer takes as
inputs the things from the merge layer
and then outputs a single number and
hopefully that's going to be a one if
the images are the same and is zero if
the images are not the same so use a
sigmoid activation function because it's
kind of a binary classification and then
we use the keras functional definition
to define this because actually not a
sequential model right because we have
two inputs and then we're combining them
it's not just a simple sequential model
we might be used to and so we use a more
complicated way of defining it then we
compile the model we use binary
cross-entropy because we're doing a
single binary classification and these
are our standard adam optimizer and
we're gonna output the accuracy so let's
take a quick look at what this model
looks like before we run it here
you can see here that we have 100,000
parameters in our dense layer that
corresponds to image 1 and 100,000
parameters that corresponds to our image
2 and then each of those fully connected
layers they output 128 numbers we
combine those into 256 numbers and then
we have a single perceptron with 256
inputs and one single output at the
bottom of our network so in total it's
about 200 thousand parameters and we can
call fit here and now again we'll call
fit on actually pairs train 0 right so
that's one of the input images pairs
train 1 which is the other set of input
images and then labels train which is
again the binary number 0 if the images
don't match and 1 if they do match so
let's set that to say 10 epochs and let
our model train
so this architecture does work barely so
you can see that in every step it
actually is improving the accuracy but
by about like 0.5% and it's starting at
a 50 percent accuracy so better than
random which is better than a lot of the
networks that I've made in my life
and we're kind of onto something good
but it seems pretty clear that we're
gonna need to make this work better so
we've done so far it doesn't work super
well it's unclear actually how well
it'll ever work it does work better than
random but it's not working super well
it's not typically what people do when
they encounter the situation where they
want to do one-shot learning what they
really do is they share weights across
the model right so sharing weights
across layers is actually pretty common
in more advanced architectures but we
haven't done it yet so it's a good thing
to know and it's actually really
effective in this case it's one of the
things we have to do to make this thing
really work well and the intuition is
that the model that we're running on the
first input image and the model it
running on the second input image really
it seems like they should be the same
model right because the images are drawn
from the same set of overall images and
so the transform that you want to do on
one image seems like it shouldn't really
be the transform that we do on the other
image in order to do this in order to
share weights across the model we have
to actually use more of Keras's
functional model definition and I think
this gets a little confusing because
when we define a layer in the in the
functional definition we actually it
sort of just sets up the specification
for the layer and it doesn't actually
really attach it to some input until we
call a function on that layer once
specified so then we set up a model
right and so we actually say that the
input is going to be this input and
actually what the model does is going to
be this flatten step in then this dense
step but now we haven't actually
attached this model to any input so what
we're gonna do we're actually gonna
attach it to two different inputs we're
gonna attach it to input 1 and input 2
and so I call the model that's attached
to input 1 dense 1 and the model
attached to input 2 dense 2 so we have
two separate models but they're attached
two different inputs so we can actually
take those and we can use that same
concatenate layer that we used before to
combine them and then we can add that
same dense layer that we had before and
that's gonna output a single number and
we're going to use a sigmoid activation
layer and that number is obviously going
to be one if we think that these two
images are corresponding the same number
and zero if they're like two different
numbers so we can pile the model in the
same way we did before and then we can
take a quick look at it and we can see
that actually this model should have
about half the number of parameters of
the previous model because we're sharing
those parameters right so whereas before
we had kind of two layers each with a
hundred thousand parameters now we only
have one set of a hundred thousand
parameters but it's it's actually two
different layers are getting called but
each of those shared parameters so we
can run this model too and spoiler alert
it works a little bit better than the
last thing we did but not a lot better
because there's actually one more fancy
optimization that we need to add in and
then we'll have kind of the typical
setup of what's called the siamese
Network which is actually an old concept
you know it was talked about in
the 90s but I feel like it's had renewed
interest in various forms as people have
gotten more and more excited about deep
learning and it's kind of one shot
learning problem specifically so you
knew from Keras import back-end as K
and this kind of harkens back to the
time when Karas had typically multiple
backends these days really it's almost
always tensorflow so I just sort of look
at any tensorflow
operation is something that I can run
here and now I define a function where
it takes in inputs which actually going
to be tensor flow tensors and then I can
call K dot and then any tensor flow of
operation that I can find so here I'm
using sum and square and square root
and maximum really all this is doing is
it's basically looking at the sum of the
squares of the differences between the
two inputs right so it's kind of a
simple they call Euclidean distance it's
really just how different are the
outputs of my two different networks
that's what we're gonna do with it we're
gonna feed in the outputs of each
network and then we're going to compare
them using the Euclidean distance
so we define this nice little Euclidean
distance function and then we add a new
layer it's called a lambda layer which
implies kind of a lambda function and we
actually pass in our Euclidean distance
function and so now we're actually
building our own custom tensor flow
operations as a layer and what this does
is it basically lets the network instead
of trying to figure out what it should
do with the outputs of these two
networks that we've defined it just
knows that really what I want is the
outputs of these two networks to be
similar right so the more similar the
outputs of my two networks are the more
likely the model thinks that the two
inputs are the same of course buying the
same number so let's run this this
network and we can compile it and look
at it I'm just like we did before and we
see that it's very similar to the
previous model but we don't have that
last big dense layer to figure things
out and then when you run this network
we actually see a marked improvement
right in the first epoch we're already
seeing accuracy above 70% so by taking
out some of the complexity and again
just pushing the complexity into the
code we've actually made a much more
effective siamese network so you know
the real reason to do this is not the mnist data set right I mean it seems
unlikely that you'd want to generalize
to some other digit that we haven't seen
before but there actually are lots of
cases where you'd want to do it in one
case is in handwriting where you might
see characters that you haven't seen
before and a super cool data set to do
this on that's that's really fun and a
lot like chemists is the Omniglot data
set so I'm actually left in a little bit
of code to load in the omniglot data
set which actually loads in lots and
lots of different characters from lots
and lots of different languages and so I
think a fun next step to do would be to
run this exact same architecture on the
Omniglot data set and see if you can
recognize characters and actually see if
you can build a system that can
recognize characters in one alphabet and
generalize to other alphabets because
that is really magical and powerful and
really shows off why one-shot learning
can be really effective and especially
why Siamese networks work really well
for this application
you
