Hello everyone? Welcome to the second lecture for CS230.
So as I, I said earlier, uh,
you can go on menti.com, uh,
from your smartphones or your computers,
and enter this code, 845709.
Uh, we will use this tool for interactive questions
during the lecture and we will also use it to, to track attendance.
Uh, I'll add it at the end of the lecture,
but, uh, if you have time do it now.
[NOISE] Let's start the lecture,
while you guys are doing that.
Okay. So today's lecture is going to be about deep learning intuition,
and the goal is to give you a systematic way to think about projects,
everything related to deep learning.
It includes how to collect your data,
how to label your data,
how to choose an architecture,
but also how to design a proper loss function to optimize.
So all of these decisions are decisions you're going to have to do, during your projects.
And we'll try to give you here an overview of,
uh, this systematic way of thinking for different projects.
It's going to be high level,
more than other lectures,
but we hope it gives you a good start for your project.
We will start with the ten minute recap on, uh,
what you've seen in the two first,
in the first week, uh, about neural networks.
So as you know you can think of, uh, machine learning,
deep learning in general, as modeling
a function that takes an input that can be an image,
a speech, a natural language,
or a CSV file,
give it to a box and get an output that can be classification.
Is it a cat, zero,
is, is there a cat on this image,
output one, or is there no cat on this image, output zero?
And I think a good way to remember what is a model is to
define it as architecture plus parameters.
Architecture is, uh, the design that you choose.
So logistic regression is the first one you've seen.
You will see shallow neural networks, deep neural networks,
then you will see convolutional neural networks,
and record neural networks.
So these are all types of
architectures and you can choose to make them deeper or shallower.
Parameters are the core parts.
They're the numbers that make your function
take this cat as inputs and convert it to an output.
So these are millions of numbers,
and the goal of machine learning deep learning,
is to find all these numbers.
So we're all, uh,
trying hard to find numbers basically,
millions of numbers in matrices.
If you give this cat and you forward propagate it,
so we propagate it through the model to get an output.
You will have to compare this output to the ground truth.
Uh, the function used to do so is called the loss function.
You've seen an example of a loss function this week.
That is the logistic loss function.
Uh, we will see more or loss functions, uh, later on.
Uh, Computing the gradient of this loss function,
is going to tell you how much should I move my parameters in order to update,
uh, in, in order to make the loss go down.
So in order to make this function recognize cats better than before.
We do that many, many times,
until you find the right parameters to plug in your architecture,
you can then give your cats and get an output.
What is very interesting and deep learning is that many things can change.
You can change the input.
We talked about natural language speech,
structured and unstructured data in general.
You can change the output, uh,
It can be a classification algorithm,
it can be a multi-class algorithm.
I can ask you, give me the breed of the cats,
instead of asking you give me just the cat,
which makes the problem more complicated.
It can also be a regression problem.
I, I give you the cat and I ask you give me the age of the cat,
which is much more complicated again. Does that make sense?
Okay. Another thing that can change is the architecture,
we talked about it earlier.
And finally, the loss function.
I think the last is function is something that,
that people struggle with to understand what loss function to,
to choose, uh, for a specific project
and we're going to put a huge emphasis on that today.
Okay. And, of course,
in the architecture you can change the activation functions,
in this optimization loop you can choose a specific optimizers.
We're going to see in about three weeks,
all the optimizers that can be Adam,
stochastic gradient descent, batch gradient descent, RMSprop and momentum.
And finally, all the hyper parameters.
What is the learning rate of this loop?
What is the batch that I'm using for my optimization?
We are going to see all that together,
but there's a bunch of things that can change in this scheme.
Any questions on that, in general?
So far so good.
Okay. So let's take the first architecture that we've seen together, Logistic Regression.
As you know, an image in computer science can be represented by a 3D matrix.
Each matrix represents a certain color.
RGB, red, green, blue.
We can take all these numbers from these 3D matrix and put it in a vector.
We flatten it in order to give it to our logistic regression.
We forward propagate it.
We multiply it by w, which is,
our parameter and b, which is our bias.
Give it to sigmoid function, get an output.
If the network is trained properly,
we should get a number that is more than 0.5
here to tell us that there is a cat in this image.
So this is the basic scheme.
Now, uh, my question for you is,
if I want to do the same thing but,
uh, I want to have a classifier that can classify several animals.
So on the image there could be a giraffe,
there could be an elephant or there could be a cat.
How would you modify this architecture?
Yes?
[NOISE] [inaudible]
Yes, exactly. So that's a good point.
We could add several units.
So several neurons, one for each animal and we will call it, multi-logistic regression.
So it could be something like that.
So we have a fully connection here,
before we were all,
all the inputs were connected to this neuron,
and now we added two neurons.
And each neuron is going to be responsible for one animal.
How do we know which neuron is responsible for which animal?
Is the network going to figure it out on its own,
or do we have to help it?
[NOISE] [inaudible]
Exactly, the label is important.
So what is going to tell your model this neuron should focus on cat,
this neuron should focus on elephant,
this neuron should focus on giraffe?
Is the way you label your data.
So how should we label this data, now,
if we were to do this specific tasks.
Any ideas? Yeah.
Uh, [NOISE] One-hot vector.
One-hot vector. Okay. So one-hot vector means,
a vector with all zeros and one,
one. Any other ideas?
[NOISE] One, two, three.
[NOISE] One, two, three.
So I assume you,
you say that each integer would correspond to a certain animal [NOISE]?
Okay. Any other ideas?
Modifying the loss function.
Modifying the loss function.
You mean, you want to put more weight on one animal,
so you modify the loss function?
Or what exactly- [NOISE]
It was more like towards the one-hot encoding, but [inaudible]
I see, with the one-hot encoding. So I agree with the one-hot encoding.
I think there's a downside to the one-hot encoding.
What is the downside of the one-hot encoding?
[NOISE] [inaudible]
Yes. So you're saying that the data without- if we have a lot of animals,
the data- the labels only contain zero and one,
one, so there's a huge imbalance there [NOISE].
I don't think that's an issue because these neurons are
independent from each other right now.
So yeah it, it could run into an issue if you have,
uh, you have really a lot of animals, that's true.
But there is another problem with it.
The problem is that,
do you think if you one,
if you one-hot encode your labels,
you would be able to detect an image with a giraffe and an elephant on the image?
You will not be able to do so.
You need a multi-hot encoding.
So in this case, if there is a cat on image I will use a one-hot.
I would say zero,
one, zero as my label.
But if I have a dog and a cat on the image I would say, one, one, zero.
Okay. The one-hot encoding works very well when you
have the constraint of having only one animal per image.
And in this case, you would not use an activation function called Sigmoid,
you would use another one, which is?
[NOISE] Softmax.
Softmax, yeah. The Softmax function,
we're going to see it together.
And for those of you who took 229,
you've probably heard of it.
Okay. So what I wanted to explain here is,
the way you choose your labeling is very important and it's a decision you should make,
prior to start the project.
Okay. In terms of notation,
uh, In, In this class we're going to use the following.
A, square bracket one will denote all the activations of the first layer.
So the square brackets would,
would denote the layer and the lower script will denote the,
their index of the neuron in the layer.
Okay? And of course you can stack these neurons on top of each other to make the,
the network more complex,
depending on the task you're solving.
Okay. Now, the concept I wanted to introduce in this recap was the concept of encoding.
Uh, you probably- some of you have probably seen this image before.
If you have, uh, a network that is not too shallow,
you will notice that what the first neurons
see are very precise representations of the data.
So there are pixel level representations of the data.
X3i is probably, one of the three channels of the 3D matrix, just one number.
So what this neuron sees,
is going to be a pixel level representation of the image.
Okay? What this neuron sees, the second layer,
the one in the hidden layer,
is going to see the representation outputted by all the neurons in the first layer.
These are going to be more high level, more complex.
Because the first neurons will see pixels,
they are going to output a little more detailed information,
like, I found an edge here,
I found an edge there, and so on.
Give it to the second layer.
The second layer is going to see
more complex information and is going to give it to the third layer,
which is going to assemble some high level complex features that could be eyes,
nose, mouth, depending on what network you've been training.
So this is an extraction of what's
happening in each layer when the network was trained on,
uh, face recognition. Yes.
Um, doesn't this only apply to [inaudible] [NOISE] networks,
because the combination [NOISE] [inaudible] does not necessarily, uh, [inaudible] .
Yeah, yeah, yeah. So I think if I,
here I give a you fully-connected network, but that's true.
These type of visuals,
ah, are more, ah,
observed in convolutional neural networks because these are filters,
but this happens also in this type of network,
it's just harder to visualize.
Okay. So, this is what we call an encoding.
It means if I extract the information from this layer,
so all the numbers that are coming out of these edges,
I extract them, I will have a complex representation of my input data.
If I extract the numbers that are at the end of the first layer,
I will have a lower level representation of my data.
That might be edges, okay?
We're going to use this encoding,
ah, throughout this lecture.
Any questions on that?
Okay. So let's build intuition on concrete applications.
We're going to start, ah, with a short warm-up with the Day'n'Night classification,
and then quickly move to Face verification and Face recognition.
And after that, we'll do some Art generation and finish with a Trigger-word detection.
If we have time, we'll, well talk about how to ship a model,
which is shipping architecture plus parameters, okay,
with an emphasis, as I said,
on the architecture, the loss,
the training strategy, to help you make decisions during your project.
[NOISE] So, let's start with the first game.
[NOISE] Ah, we're given an image and we have to
build a network that tells us if the image is taken during the day,
label zero, or was taken at night, label one.
[NOISE] So, first question is,
what dataset do we need to collect?
Imaging captured.
Um?
Imaging that are captured during the day and during the night and it's labeled.
Okay. Labeled images captured during the day and during the night.
I agree, though probably,
oh, yeah, let me ask the question. How many images?
[LAUGHTER] That was wrong, actually.
[LAUGHTER] How many images,
like how do you get this number?
[NOISE] Can someone give
me an estimate of how many images you need in order to solve this problem,
and explain how you get this estimate.
A number that's similar to a number of parameters.
You're saying a number similar to a number of parameters that you've in the network?
Yeah.
So I think it's better to think of it in the other way around.
The network comes after,
so you, right now, you don't know what networks you will use.
So you cannot decide the number of data points based on your parameters.
Later on, based on how your network is flexible,
you can add more data,
and a- ah, that's probably what you meant.
But, at first, you want to get,
you want to get the number. Yeah.
More o- more images than pixels within an image?
More images than pixels within an image.
Ah, I- I don't think that, that, that's,
that that has anything to do with the pixel within an image.
You can have a very simple task, like,
you have only images that are red and green,
and you want to classify red and green.
[NOISE] The image can be giant,
you can have a lot of pixels, it's not gonna change the number of data points you need.
Maybe images that have computation resources [inaudible]?
Okay. So, you're talking about computation resources,
so m- the more images we have,
probably the more computation resources we will need, is that what you mean?
Yeah, there is something like that.
I think, in general, ah,
you want to try to gauge the complexity of the task.
So, let's say, we did a problem that was cat recognition.
Detect if there is a cat on an image or not.
In this problem, we remember that with 10,000 images,
we managed to train a pretty good classifier.
How do you compare this problem to the cat problem?
You think it's easier or harder?
I think it's easier.
Easier. Yeah, I agree. That's probably easier.
[NOISE] So in terms of complexity,
these tasks looks less complex than the cat recognition task,
so you would probably need less data.
That's a rule of thumb. The second rule of thumb and why I get to this image is,
what do we exactly want to do?
Do we want to classify pictures that were taken outside,
which seems even easier?
Or do we want also the network to classify complicated pictures?
What, what do I mean by complicated pictures?
Inside your house. Um?
Inside your house.
So like, let's say, on a picture you have a window on the right side.
A human would be able to say it's the day because I see the window,
but for the network, it's going to take much longer to learn that,
much longer than for pictures taken outside.
What else? What are other complicated, okay, in the back.
Uh, like dawn or twilights or edges, um-
Dawn, twilight, sunrise, sunset, in general?
It's complicated because you have to define it and you have to teach your network what,
what does that mean, is it night or day.
Okay. So, depending on what task you want to solve,
it's going to tell you if you need more data or less data.
I think, for this task,
if you take outside pictures,
10,000 images is going to be enough,
but if you want the network to detect indoor as well,
you probably need 100,000 images or something.
And this is based on comparing with projects you did in the past,
so it's gonna come with experience.
Now, as you know, when you have a dataset,
you need to split it between train,
validation, and test sets.
Some of you have heard that.
We are going to see it together even more.
You need to train your network on a specific sets and test it on another one.
How do you think you should split these 10,000 images?
Um? 50-50 between train and test?
80-20.
80-20? I think we, we,
we go more towards 80-20 because the test sets is made for analyze,
to analyze if your network is doing well on real-world data or not.
I think 2,000 images is enough to get that sense, probably,
and you want to put complicated examples in this dataset as well,
so I would go towards 80-20.
And the bigger the datasets,
the more I would put in the train set,
so if I have one million images,
I would put even more like, 98 percent, maybe,
in the train set, and two percent to test my model, okay?
Now, I wrote bias here.
What do I mean by bias?
You just have a correct, like, balance between classes.
Yes. You need a correct balance between classes.
You don't want to give 9,000 dark images and 1,000 day images.
You want to balance between these two to teach your networks to recognize both classes.
Okay. What should be the input of your network?
Um? The pixel image.
Yeah. So, this is an example of a pixel image.
It's the Louvre Museum during the day.
[NOISE] Harder question.
What should be the resolution of this image, and why do we care?
The more resolution [inaudible] [NOISE]
Okay. That's great. So, you said,
let me repeat for SCPD students as well,
as low as you can,
in order to achieve good results.
Why do we want low resolution?
It's because in terms of computation, it's going to be better.
Remember, if I have a 32 by 32 image,
how many pixels there are?
If it's color, I have 32 times 32 times three.
If I have 400 by 400,
I have 400 by 400 by three. It's a lot more.
So I want to minimize the resolution in order to
still be able to achieve good performance.
So what does it mean to still achieve good performance?
How do I get this number?
I'd continue with a similar resolution as opposed to the, uh, partial [inaudible].
Okay. Similar resolution as you expect the algorithm in real life to work on?
Yeah. Probably, I agree. What else?
What other rule of thumb can you use in order to choose this resolution?
Perhaps, um, we compare it to the performance of the [inaudible] we can tell if it's there [inaudible].
Yeah.
Great idea. Compare to human performance.
So what I do, so there is one way to do it,
which is the brute force way, I would say.
We will train models on different resolutions and then compare the results,
or you can be smart and use human performance as a comparison.
So I would print this image or
several images like these in different resolutions on paper.
And I would go see humans and say classify those,
classify those, and classify those.
And I would compare human performance on all these three types of resolution,
in order to decide what's the minimum resolution that I can use,
in order to get perfect human performance.
So by doing that, I got that 64 by 64 by three was enough resolution,
for a human, to detect if an image is taken during the day or during the night.
And this is a pretty small resolution in imaging,
but it seems like a small, like an easy task.
If you have to find a d- d- a breed of a cat,
you probably need more because some cats are very,
look very alike, and you need a high resolution to distinguish them,
and maybe training for the human as well.
I know only three breeds of cats so I wouldn't be able to do it anyway.
What should be the output of the model?
Labels about the image.
Labels, so Y equals zero for day,
Y equal one for night. I agree.
What should be the last activation of the network?
[NOISE] The last function?
Sigmoid.
Sigmoid. We saw that Sigmoid takes a number
between plus infinity- minus infinity and plus infinity,
puts it between zero and one so that we can interpret it as a probability.
What architecture would you use?
Fully-connected or convolutional.
Fully-connected or convolutional. I think,
later this quarter, you will see that convolutionals perform well in imaging,
so we would directly use a convolutional,
but I think a shallow network,
fully-connected or convolutional, would do the job pretty well.
You don't need a deep network because you gauge the complexity of this task.
[NOISE] And what should be the loss function, finally?
[NOISE]
It could be, um,
maximum number of functions like, uh, log-likelihood.
Yeah. So, the log-likelihood.
So, it's also called the logistic class,
that's the on you're talking about [NOISE].
So, the way you get this number and you'll prove it in CS 229.
We're not going to prove it here.
But basically, you interpret your data in a probabilistic way and you
take the maximum likelihood estimation of the data which gives you this formula,
for those of you who did the math behind.
You can ask in office hours,
TA is going to help you understand it more properly.
Okay. And of course,
this means that if y equals zero,
we want y hat the prediction to be close to zero.
If y equal one we want y hat the prediction to be close to one.
Okay. So, this was the warm up.
Now we're going to delve into Face verification.
Any you question on day and night classification. Yes.
You said that you increase the data without the percentage that
changes so you have
a kind of [inaudible].
So, your- the question is about how you choose
the size of the test set versus the train set.
In general, you would first say how many images do I need or data
points in order to be able to understand what my model do in the real world.
This can depend on the task.
Like if I talk about- if I- if I tell you about speech recognition,
you want to figure out if your model is doing well for all accents in the world.
So, your test set might be very big and very distributed.
In this case, you might have a few examples that are during the day,
few during the night and a few at dawn,
on sunset, sunrise and also indoor.
Three of those is going to give you a number.
So, there's no good number.
There is like you have to gauge it.
Okay one more question.
How do you chose that loss function [inaudible]?
Yeah, that's a good question. So, how do you choose the loss function?
We're going to see in the next, uh,
in the next slides how to choose loss functions but for this one specifically,
you choose this one because it- it- it's a, it's
a convex function for classification problem.
It's easier to optimize than other loss functions.
So, there is a proof but- but I will not go over it here.
If you know L1 loss,
that compares Y to Y hat this one is harder to optimize for a classification problem,
we would use it for regression problems.
Okay. [NOISE] So, our new gain is the school wants to use
face verification to validate student IDs in facilities like the gym.
So, you know, when you enter the gym,
you swipe your ID and then, uh,
I guess the person sees your face on the screen based on
this ID and looks at your face in real and comparison let's say.
So, now we want to put a camera and have you swipe and
the camera is going to compare this image to
the image in the database. Does that make sense?
To let you in or not. So, what's-
what dataset do we need to solve this problem? What should we collect?
Yeah. Okay. Between the ID and the image.
Yeah, so probably schools have databases because when you enter
the school you submit your image and you also are given a card, an ID.
So, you have this mapping.
Okay. What else do we need?
So, pictures of every student labeled with their names, that's what you say.
So, this is a picture of Bertrand.
This is a picture when he was younger.
And that's the one he gave to the school when he arrived.
What should be the input of our model? Is it this picture?
More photos of him.
More photos of him.
More photos of him. I'm asking just like the input of the model.
Like we probably need more photos of him as well but
what's- what's going to be the image we give to the model?
Exactly the person standing for verification.
Exactly, the person standing in front of the camera when entering the gym.
So, this is the entrance of the gym
and Bertrand is trying to enter the gym. So, it's him.
Okay. What should be the resolution?
Those of you who have done projects in imaging,
what do you think should be the resolution?
256 by 256.
256 by 256, any other idea more precisely.
I think in general [NOISE] you will go over 400,
so 400 by 400.
What's the reason? Why do we need
64 for- for day and night and 400 for face verification?
The video takes different shapes.
Yeah. There's more details to detect.
So, like distance between the eyes probably,
size of the nose, mouth,
uh, general- general features of the face.
These are harder to detect for a 64 by 64 image.
And you can test it, you can go outside and show
two pictures of people that look like each
other and ask people can you differentiate those two person or not.
And you'll see that with less than that sometimes it's- people are struggling.
Is color important?
Is color important. That's a good question.
We should have talked about it in day and night actually.
Is color important. Because if you remove the color,
you basically divide by three the number of pixels, right?
So, if we could do it without color,
we would do it without color.
In this case, color is going to be important because, uh,
probably you want your camera to work in, uh,
different settings, day and night as well.
So, the luminosity is different,
the brightness and also we all have
different colors and we need to all be detected, compared to each other.
Yeah. I might go somewhere in an island and come back, uh,
you know, full of color but,
uh, but I still want to be able to access the gym.
Uh, output. What should be the output?
The question on the resolution, is that a minimum resolution or is that like a-
I think if you have mo- in unlimited computational power,
you would take more resolution but that's a trade-off between computation and resolution.
So, output is going to be one,
if it's you and zero if it's not you in which case they will not let you in.
Okay. Now, uh, the question is what architecture should we use to solve
this problem now that we collected
the data set of mapping between student IDs and images.
The question is how do you know how many images you need to train the network-
The question is- [OVERLAPPING] [inaudible].
How do you know how many,
many images you need to train the network.
You don't know, you can find an estimate.
It's going to depend on your architecture.
But in general, uh,
the more complex a task, the more data you will need.
And we will see something called error analysis in
about four weeks which is once your network works,
you're going to give it a lot of examples.
Detect which examples are misclassified by
your network and you're going to add more of these in the training set.
So, you're going to boost your datasets.
Okay. Talking about the architecture.
If I ask you, what's the easiest way to compare two images, what would you do?
Like these two images,
the database image and the input image.
Some sort of hash.
Some sort of hash, what do you mean by that.
Taking the input run, uh,
set a specific function on it and then there.
Okay. Take an- take this,
run it into a specific function,
take this run it into a specific function and compare the two values.
That's great. That's a good idea.
And the more basic one is just compute the distance, uh, between the pixels.
Just compute the distance between the pixels and you get if it's the same person or not.
Unfortunately, it doesn't work and
a few reasons are the background lighting can be different.
And so if I do this minus this,
this pixel which is let's say dark is going to have a value of zero,
this pixel which is white is going to have a value of 255,
the distance is gigantic but it's still the same person.
Is a problem. Person can wear makeup,
can grow a beard, can be younger on a picture,
the ID can be outdated.
So, it doesn't work to just compare these two pictures together,
we need to find a function that we will apply these- these- these two images
to and will give us a more- a better representation of the image.
So, that's what we're going to do now.
What we're going to do is that will encode information,
use the encoding that we talked about of the picture in the vector.
So, we want a vector that would represent teachers like distance between eyes,
nose, mouth, color, all these type of stuff,
hair,, uh, in a vector.
So, this is the picture of left Bertrand from the ID.
We will run it to a network and we hopefully can find a good encoding of this network.
Then we will run the picture of Bertrand at
the facility run it in the deep network, get another vector.
And hopefully if we train the network properly,
these two vector should be close to each other.
Lets say we have a threshold that is 0.5,
0.4 is the distance between these two.
Is less than the threshold.
So, I would say Bertrand is the right person.
Is you. Does this scheme make cha- make sense.
What is the 1.28 d represent?
What does the 1.28 d vector represent.
The real question is can I say that the third entry corresponds to something specific?
It's complicated to say but depending on
what network you choose and the training process you choose,
it will give you a different network, a different vector.
So, that's what we're going to talk about now.
The question is how do I know that this vector is good?
Like right now, if I take a random network,
I give my image to it,
is going to output a random vector.
This vector is not going to contain any useful information.
I want to make sure that this information is useful and that's
how I will design my loss function.
Okay. So, just to recap,
we gather all students faces encoding in
a database once we have this and given a new picture,
we compute the distance between- between
the new picture and all the vectors in the database if we find a match.
Oh sorry. We compare this vector of
the input image with the vector corresponding to the ID image.
If it's small, we consider that is the same person.
Okay. Now talking about the loss and the training to figure out
is this vector corresponds to something meaningful.
First, we need more data because we need our model to understand
in general the features of the face and a university that
has a 1000 students is probably not going to be enough to have
1000 image in order to push a model to understand all the features of the face.
Instead we will go online find open datasets with millions of pictures of faces
and help the model learn from
these faces to then use it inside the facility. There was a question in the back.
Why couldn't [inaudible] work out like we did with the, like the-
[inaudible] but every student is uh, one?
That's another option. So the question is why can we tell continues the one-hot encoding.
We could build a classifier that has n output neurons,
n corresponding to the number of students in
the school and you take an image you
run it to the network is going to tell you which student it is.
What's the issue with that?
Every year students enter the school you will have to modify your network every
year because you have more students and you need a higher output vector,
a larger output vector.
You- we don't wanna retrain all the time our networks.
Okay, so what's- what,
what we really want, if,
if we wanna put it in words,
is that's uh, oh, there's a mistake here.
What we really want is,
if I give you two pictures of the same person,
I want to similar encoding,
I want the vector to be similar.
If I give you two pictures of different persons,
I want different encodings,
I want the vector to be very different and we are going to
rely on these two assumptions and these two thoughts in order to generate uh,
our loss function by giving it triplets,
triplets means three pictures: one that we call anchor,
that is the person, a person,
one that we call positive,
that is the same person as the anchor but a different picture of
that person and the third one that we call negative,
that is a picture of someone else.
And now what we wanna do is to minimize the encoding distance between the anchor and the
positive and maximize the encoding distance between the anchor of, and the negative.
Does, the- these two thoughts makes sense?
So now my question for you is,
what should be the loss function?
What should be the loss function,
so please go on menti and enter the code and there are three options here A,
B and C,choose which of
these you think should be the right loss function to use for this problem.
Uh, you have it on your phone as well, like issue, yeah,
it's small on the screen but you can see it on, on its cutoff?
It's better here? [NOISE]
We can't see the URL [inaudible]. It's too small.
[NOISE]
A45709, can you see it on your phone?
So by Enc of A,
I mean the encoding vector of the anchor,
by Enc of P, I mean the encoding vector of
the positive image after you run them through the network.
[NOISE]
Okay 30 more seconds.
[NOISE]
Okay.
I- 20 more seconds.
Okay let's see what we have.
Okay. So, two-thirds of the people think that's,
that it's the first answer A, so I,
I read it for everyone,
the loss is equal to the L2 distance between the encoding of A and the encoding of
P minus the L2 distance between the encoding of A and the encoding of N. So,
someone who has answered this,
do you wanna give uh, an explanation? Yes.
We're are trying to minimize the first difference between
N the positive and we're trying to maximize
difference between A and the negative let me
subtract, so the [inaudible].
Yes, that's correct.
So what you said I repeat it [NOISE] for [inaudible] students.
We wanna maximize the distance between the encoding of A and the equity of the negative,
that's why we have the minus sign here,
because we want the loss to go down and to go down we put a minus sign and we
maximize this term and on the other hand
we wanna minimize the other term because it's a positive term,
okay so I agree with answer.
Okay, that was the first time you use this tool,
it's gonna be quicker next time.
Okay, so we have uh, we have uh,
figure out what's the loss function should be and now thinking about it.
Now that we designed our loss function,
we're able to use an optimization algorithm,
run an image in the network,
sorry run, run three images in the network, like that.
Gets three outputs encoding of A,
encoding of P, encoding of N,
compute the loss, take the gradient of the loss
and update the parameters in order to minimize the loss.
Hopefully after doing that many times we would get an encoding that
represents features of the face because
the network will have to figure out who are the same people,
who are different people.
Does it make sense? This is called the triplet loss.
And I cheated a little bit in the,
in the quiz, I didn't write this alpha.
The true loss function contains a small alpha, you know why?
Yes?
So we don't have negative loss?
[NOISE] Yeah that- that's not exactly the role of the alpha,
in order to not have negative loss what,
what you can do is to use a maximum of the loss and zero and train on
the maximum of the loss and zero but there is another reason why we have this alpha.
Yes?
[inaudible] to have uh, difference between like false
negative and false positive like which one do you prefer?
Which one do you prefer based on false negative and false negative,
no i- it- it's not about that.
So sometimes you have an alpha in loss function to put a weight on
some classes but this is an additional alpha,
it's not a multiplicative alpha.
So, it has nothing to do with that. Yeah?
To penalize large weight.
To penalize [NOISE] large weight,
so you're talking about [NOISE] penalization.
If we had weights in
this formula next to the alpha like alpha times the norm of the weights,
this would be regularization,
but here this term doesn't penalize weight.
[inaudible].
It's not gonna affect the gradient,
it's not gonna affect, it's not gonna affect the weights,
but the reason we have it here is because let's say the encoding function is uh,
let's say the encoding function is just a function zero.
What we are going to have is that we're going to have encoding of A equals
zero minus zero and here zero minus
zero and so we will have basically a perfect loss of zero uh,
and we still didn't train our network,
we just learned the function null.
So this alpha is called the Margin and it pushes
your network to learn something meaningful in order to,
to stabili- stabilize itself on, on zeros. Okay?
[NOISE] [inaudible]?
Yeah, so it also
has to do with the initializations but because we didn't talk about
initialization yet we only saw zero initialization,
I think in concentration to- together.
Another way to, to,
to avoid uh, the networks to stabilize or to,
to become stable on zero is to change
the initialization scheme and in two weeks we're
going to see difference initialization schemes together.
[NOISE] Yeah?
[inaudible]. [NOISE]
So, the question is how do we know that
this network is going to be robust to rotations of the image,
or scaling of the image,
or translation of the image?
We know it's because in the dataset,
we are going to give let's say your picture and your picture scales,
and we're going to tell the network this is the same person.
So, the network will have to learn that the scale doesn't mean it's not the same person.
You have to learn this feature. Okay. One more question
and then we move on. In front, yes.
So why is it starting at zero a problem?
Can't we just make it negatives loss value?
Yeah. That's a good question. Why is it a problem to,
to, to stay at- to stabilize at zero?
It's because its common to keep then the loss function positive,
and in the paper that you can find, this FaceNet paper,
they don't train exactly this loss,
they train the maximum of this loss and zero.
Yeah. Okay. So you train and you get the right function.
Now, let's make the problem a little more complicated.
What we did so far was face verification,
we're going to do face recognition.
What's the difference? The difference is there is no more ID.
So now you just have a camera in the facility,
you enter, the camera looks at you and find you.
How would you design this new network?
Yes, in the back.
[inaudible] you've added in an element now of recognition as well,
because now before you'd sort of stand in
front of it and it new that every picture had a face,
now it needs to detect the face.
Okay. So you're saying maybe we need to add an element to
the pipeline that is a diction- detection element.
That's true in general for face recognition.
Uh, let's say you have a picture that is quite big.
You want to use the first network that identifies the face,
like finds it on the picture, detects it,
and then crop the face and give it to another network, that's true.
That could also be used in verification as well. Yeah.
[inaudible] because they are
taking more and more time to go through all the faces in your database.
Great. So the difference maybe with what you're saying is
maybe we can use a verification algorithm that you've trained.
But instead of looking one-to-one comparison we look at one to N comparison.
So we have the pictures of all the students in the database.
What we can do is run all these database pictures in the model,
get a vector that represents them,
right? We get the vectors.
Now, you enter the facility,
we get your picture, we run it through the model,
we get your vector and we can compare this vector to
all the vectors in the database to identify you.
What's the complexity of this?
It's the number of students.
You have for every prediction to go over the whole database.
And a common network like model that you can use to do that is K-Nearest Neighbors.
So, of course, if you have only one picture per students,
it's not going to be very precise.
But if you collect three pictures per students and you
run a two nearest neighbors algorithm,
it will decide that if the two pictures are the same it's
likely that this person is the same as the two person on the picture.
Okay? Now, let's make it a little more complicated.
You probably saw that on your,
on your phones, uh,
sometimes you take a picture and it recognizes that it's,
uh, your grandmother or your grandfather or your mother and father.
Uh, what's happening behind is that there's some clustering happening.
It means we have a bunch of images and we wanna cluster them together.
So this is also another algorithm that you see in CS229 and CS229A,
which is K-Means algorithm.
And this is a clustering algorithm by
taking all the vectors that we have in the database.
We can find, uh- Let's say,
sorry, you have a- you have a your phone,
you have thousands of pictures of let's say 20 different people.
What you want is to cluster all the pictures of the same person separately.
What you will do is that you will encode all the pictures in vectors,
and then you will run a clus- clustering algorithm like
K-means in order to cluster those into groups.
These are the vectors that look like each other,
these are the vectors that look like each other.
Okay? And then you can simply give
folders to the users with all the pictures of your mom,
all the pictures of your dad and so on.
How to, uh, define the K in this case.
Sometimes like obviously all the people [inaudible].
Good question. How- how do you define the K?
So someone has an idea actually.
[inaudible].
Yeah. So one- one way is to, as you said,
to try different values,
trainer clustering algorithm and look at a certain loss you defined how small it is.
There's actually an algorithm called X-means,
that is used- X-means,
you might search for that if you want- to find, uh,
to find the K. There is also a method called the Elbow Method and that you want to
search for as well to figure out the K. Okay.
And, as you said, maybe we need just to detect
the face first and then crop and give it to the algorithm.
One more question on, on face verification and connection.
So would you also use the,
like factor of [inaudible].
Sorry, can you- can you repeat louder?
Do you also need to use that vector that you trained for [inaudible]?
Do you need to use the vector that you trained for classification?
Um, sorry, I do, I do not understand.
So you mean could-
Yeah. So is the vector after you've changed the [inaudible]?
Oh, so where is the encoding coming from?
That's what you mean in, in the network?
Yeah.
Okay. Good question.
So you have a deep network and you want to
decide where should you take the encoding from.
In this case, the more complex the task,
the deeper you would go.
But for face verification,
what you want and you know it as a human,
you want to know features like, uh,
distance between eyes, nose and stuff,
and so you have to go deeper.
You need the first layers to figure out the edges,
gives the edges to the second layer,
the secondary to figure out the nose, the eyes,
give it to the third layer,
the third layer to figure out the distances between the eyes,
the distance in between the ears.
So you would go deeper and get the encoding
deeper because you know that you want high level features.
Okay. Art generation, given a picture and make it look beautiful.
As usual, data. What do we need?
A little complicated because we have to define what beautiful is.
[NOISE] So data some beautiful pictures?
I don't know, maybe my concept of beautiful is different than yours.
[NOISE] A certain style that we want.
Data in the certain style that we want. That's a good point.
So we might say that beautiful means paintings,
like paintings are usually beautiful.
So you want to have a, that kind of a style. Yeah, that's true.
So let's say we have any data that we, we want.
What we're going to do and the way we define this problem
is let's take an image that we call the content image,
and here again you have the Louvre Museum.
And let's take an image that we call the style image,
and this is a painting that we find beautiful.
What we want is to generate an image
that looks like it's the content of the content image,
but painted by the painter of the style image.
So this style image is Claude Monet and here we have the Louvre painted by Claude Monet,
even if, uh, he was dead when this pyramid was created.
So that's our goal and this is what we would call art generation.
There are other methods, but this is one.
So how do we do that?
What architectures do we need?
And please try to use what we've seen in the past two applications together.
[NOISE] What training scheme, what application,
what, what architecture
[NOISE].
No one wants to try?
Yes.
[inaudible]
Yeah.
[inaudible]
So you're saying we,
we give- we take some style images,
we give it as input to a network and the network outputs yes or no,
like one or zero?
So do we want to generate?
We wanna generate an image, yes.
Okay. So given [inaudible].
Okay. Yes, probably.
So what you're proposing is we get an image,
that is the content image,
and we have a network that is the style, style network,
which will s- style this image and
we will get the content but styled version of the content.
Right. So it will take certain features of that style and use this to change the output.
[NOISE] Yeah. So we use certain features of
the style and change this style according to what the network is.
So this is actually done.
This is one method. That's not the one we will see today.
But [NOISE], uh, the issue with this method, which is a small issue,
is that you have to train your network to learn one style.
Network learns one style, you give the content,
it gives you the constant with the specific style of the model.
What we want to do is to have no model that is restricted to a specific style.
I wanna be able to give a painting of Picasso and get this picture painted by Picasso.
So the difference here is that we're not,
we're not going to learn parameters of a network like we did
for face verification or for day and night classification.
We're going to learn an image.
So, you remember when we talked about
backpropagation of the gradient to the parameters, were not going to do that.
We're going to backpropagate all the way back to the image. Let's see how it works.
So, first, we have to understand what content means and what's style means.
To do that, we're going to use encoding.
We're going to, to, to,
to use the ideas that we talked about later.
Giving the content image to a network that is very good
will allow us to extract some information about the content of this image.
We specifically saw together that earlier layers will detect the edges.
The edges are usually a good representation of the content of the image.
So I might have a very good network,
give my content image,
extract the information from the first layer,
this information is going to be the content of the image.
Now, the question is how do I get the style?
I wanna give my style image and find a way to extract the style.
That's what we're going to learn later in this course,
it's a technique called Gram matrix.
And the important thing to remember is that,
the style is non-localized information.
If I show you, uh, the,
the pictures in the previous slide, oh sorry, here,
you see that in the generated picture,
although on the style image there was a tree on the left side,
there's no tree on the generated image.
It means when I extracted the style,
I just extracted non-localized information.
What's the technique that Claude Monet has used to paint?
I didn't want to extract these tree that was on the style image, don't want the content.
Okay. So we're going to take a network that understands images very well,
and they're common online.
You can find ImageNet classific- classification networks online,
that were trained to recognize more than thousand- thousands of objects.
This network is going to understand basically anything you give it.
If I give it the Louvre Museum it's going to find all the edges very easily,
it's going to figure out that there is- it's during the day,
it's going to figure out their buildings on the sides and all the features of
the image because it was trained for months on thousands of classes.
Let's say we have this network,
we give our content image to it and we extract information from the first few layers.
This information we call it contents C,
content of the content image, does that make sense?
Now, I give the style image and I will use
another method that is called the Gram matrix to extract style S
style of the style image, okay?
And now the question is; what should be the loss function?
So let's go on Menti.
So same code as usual, just open it.
If you wanna repeat- you can repeat the code if you want,
845709, and these are the three proposals for the loss function.
So reminder, content C means content of the content image,
style S means style of the style image,
style G means style of the generated image,
content G means content of the generated image.
Take like a minute.
It's too small?
Oh, the code, up, 845709.
So why do need to
have an image
in that class?
You don't actually need to
classify an image on [inaudible] So why do you need to use ImageNet, [inaudible] ?
Why- so just repeating the question,
why do we need to use ImageNet?
Because we, we don't really need to classify an image and it's going to waste time.
Uh, the reason we use ImageNet is because ImageNet understands our pictures.
So if, if you give the content image
to a network that doesn't understand pictures very well,
you're not going to get their edges very well. So you want a network-.
I don't care about the classification of the pictures.
You don't care about the classification output,
you just cut the network in the middle,
extract the layers in the middle.
Okay. Let's see what the answers are according to you guys.
So if we are getting style- style of it, you are not training anything, right?
So yeah, I repeat, we're not training anything here.
We're getting a model that exists and we use this model.
But we are going to talk about the training after.
Okay. Someone who has answered the second,
uh, question and I, I will read it out loud,
the loss is the L2 difference between the style of
the style image and the generated style, plus,
the L2 distance between the gener- the generators content and the contents content. Yeah.
We want to maximize both the [inaudible]. [NOISE]
So yeah, we wanna minimize both terms here.
So we want the content of the content image to
look like the content of the generated image,
so we wanna minimize the L2 distance of these two.
And the reason we use a plus is because we also wanna
minimize the difference of styles between the generated and the style image.
So you see we don't have any terms that says style of the content image
minus style of the generated image is minimized. This is the loss we want.
Okay, up now.
Okay. So just going over the architecture again.
So the loss function we're going to use will be the one we saw.
And so one thing that I want to emphasize here is we're not training the network,
there is no parameter that we train.
The parameters are in the ImageNet classification network,
we use them we don't train them.
What we will train is the image.
So you get an image and you start with white noise,
you run this image through
the classification network but you don't care about the classification of this image.
ImageNet is going to give a random class to this image, totally random.
Um, instead, you will extract content G and style G, okay?
So from this image,
you run it and you extract information
from this network using the same techniques that you've
used to extract content C and style S. So content C and style S you have it, you have it.
You're able to compute the loss function because now
you have the four terms of the loss function.
You compute the derivatives,
instead of stopping in the network,
you go all the way back to the pixels of the image and you
decide how much should I move the pixels in order to make this loss go down,
and you do that many times.
You do that many times. And the more you do that,
the more this is going to look like the content of the content image,
and the style of the style image. Yeah, one question.
So for each new example of content and style images
you need to do a new training like this?
Yeah. So the downside of this network,
is although it has the flexibility to work with any style,
any content, every time you wanna generate an image you have to do this training loop.
While the other network that you talked about
doesn't need that because the model is trained to,
to convert the content to a style,
you just give it and cool.
Do you have to train the network on many,
kind of like Monet images or you only need to do those kind of like Monet?
Which network you talk about, this network?
Yes.
Yeah. So do we need to train this network on Monet images? Usually not.
This network is trained on millions of images.
It's basically seen everything you can imagine. Yeah.
So you only need to give one art piece to it and then it will
be able to back-propagate properly into any [inaudible].
Uh, what you mean back-propagate properly.
Here you're not training the network.
You are getting this image.
Computing the backpropagation and going back to the image,
only updating the image you don't update the network.
Where does the rps-?
It comes from Content C and style S,
it comes from the Stye S. So,
the loss function you bake- the baseline is you have Content C and
style S because you've chosen a content picture and a style picture and now every,
at every step you will find the new Content G and Style G. Backpropagates updates,
[NOISE] give it again get the new Content G and Style G,
update again and so on.
[NOISE] No, the, the art never touches with- just one time.
The art image just touches onetime the neural network you can,
you extract style S and then that's all, you don't use it again.
Okay let's do one more question here.
Why do you start white noise instead of the content or the style?
Good question. Why do you start with white noise instead of the content or the style?
Actually do you think it's better to start with the content or the style?
Probably the style.
Probably the style? I think probably the content because uh, the,
the edges at least look like the content is going to to help
you- your network converge quicker.
Yeah, that's true you don't have to start with white noise,
in generally the baseline is start with white noise so that anything can happen,
if you give it the content to start with is going to have a bias
towards the content but if you train longer.
Okay one more question and then we can move on.
So this style and content [inaudible].
ImageNet doesn't understand what's content and style but
ImageNet finds the edges on the image and so you can
give the content image and extract the few first layers to get
information about them because when it was trained on classification,
it needed to find the edges.
To find that a dog is a dog,
you first need to find the edges of the dog so it's,
it's trained to do so and for the style,
it's complicated to understand the style but the network finds all the features on
the image and then we use of post-processing technique that is
called the Gram matrix in order to extract what we call style.
It's basically ah, a cross-correlation of all the features of the network.
We will learn it together later on.
Okay, let's move on to the next application because we don't have too much time.
So this is the one I prefer, ah,
given a 10 second audio speech detect the word activate,
so you know we talked about trigger word detection,
there are many companies that have this wake word thing where you have
a device at home and when you say you're certain word it activates itself.
So here is the same thing for the word activate.
What data do we need? Do we need a lot or not?
Probably a lot because there are many accents and
one thing that is counter-intuitive is that,
if two humans, like let's say,
let's say two- two women speak as a human you would say these voices are,
are pretty similar, right?
You can detect the word.
What the network sees is a list of numbers that are totally different from
one person to another because
the frequencies we use in our voices are totally different from each other.
So the numbers are very different although as a human we feel that it's very similar.
So we need a lot of 10 seconds audio clips, that's it.
What should be the distribution?
It should contain as many accents as you can, as many, uh,
female-male voices, uh, kid-adults uh, and so on.
What should be the input of the network?
It should be a 10 second audio clip that we can represent like that.
The 10 second audio clip is going to contain some positive words, in green.
Positive word is activate and it's also going to
contain negative words in pink like kitchen,
lion, whatever, words that are not activate and we want only to detect the positive word.
What should be the sample rate?
Again same question you would test on humans,
ah, you would, you would,
you would also talk to an expert in speech recognition  to
know what's the best sample rate to use for speech processing,
what should be the output?
Any the ideas?
[NOISE]
Okay, yeah any other?
Classification, yes no.
Classification, yes no. So zero or one.
Actually let's make a test, let- let's do a test.
So we have three audio speech here,
speech one, speech two, speech three.
I don't know if we have the sound here. Do we have the sound?
[NOISE] Maybe we have it now, okay let's try.
[FOREIGN]
So this is
labeled one [LAUGHTER] Nobody speaks Italian in the,
in the, in the room, second-one.
[FOREIGN]
Okay what's the wake word?
Has anybody found what was the, the trigger word?
We need more data.
We need more.
[LAUGHTER]
So you know what's funny is,
to be this is the right scheme to label,
like it's definitely possible but it seems that
even for humans this labeling scheme is super hard.
We're not able to, to find what's, what's happening,
like I don't know,
even if I did this slide I don't even remember. No kidding.
Now let's try something else, Okay?
So now we have a different labeling scheme that
tells us also where the wake word is happening.
Let's hear it again.
[FOREIGN]
Okay, what's the trigger word?
Pomeriggio.
Pomeriggio means uh, afternoon in Italian.
Okay. So you see wha- what I,
I am trying to illustrate is uh,
compare the human to the computer and you
will get what's the right labeling scheme to use and of
course the labeling scheme here is going to be better
for the model rather than the first one and we just proved it.
Uh, the, the important thing is to know that the first one would also work,
we just need a ton of data.
We need a lot more data to make the first labeling scheme work
than we need for the second one, does that make sense?
So yeah, we will use something like that.
[inaudible] . [NOISE]
Good question, actually this is not the best labeling scheme.
As you said, should the one come before or after the word was said?
What do you guys think? Before?
After.
After, yeah. You will see that uh,
recurrent neural networks are going basically to look at uh,
the data just as human do,
like temporarily from the beginning to the end.
In this case you need to hear the word in order to detect it,
so you're going to put the one right after the word was said.
Another issue that we have with this is that there are too many zeros,
it's highly unbalanced so the network is pushed to always predict zeros.
So what we do as a hack,
and there's a lot of hacks like that happening in papers if you read them.
We're going to add several ones after the word was say,
I would add 20 ones, basically, okay?
So this is our labeling scheme now.
What should be the last activation of our network?
[NOISE] Sigmoid function,
yeah, sigmoid but sequential.
For every time step you would use a sigmoid to output zero or one, basically.
Don't worry if you don't understand spe- specifically what networks were using,
you're going to learn it in a few weeks.
So the architecture should,
should be like a recurrent neural network, probably.
Uh, convolutional neural networks might work as well,
we'll see it later on in the course and
the loss function should be the same as before but we should make a sequential.
For every time step we should use a loss function like that and we should sum them
over all the timestamp. Sounds good?
So, another insights on these projects- I'll take it
after- is what was critical to the success of this project.
I think there are two things that are really
critical when you when you build such a project.
The first one is, to have a straight strategic data acquisition pipeline,
so let's talk more about that.
We said that our data should be 10 second audio clips
that contain positive and negative words from many different access.
How would you collect this data?
[NOISE]
That's right. [NOISE]
You say you pay people to give you 10 seconds of their voice?
[LAUGHTER]
[inaudible] I think you,
you can take your phone,
go around campus and that's actually how we did it,
we took our phones, we went around campus and we got some audio recordings.
So one way to do it is that,
to go and get 10 seconds audio recordings from different
before with a large distribution of access and then what do you do?
You label? You label by hands?
That's one method, is it long or short?
Is it quick or not? It's super slow, yeah.
[inaudible]
Oh, subtitles in movies.
Uh, that's a good idea actually.
You could like based on the licensing of the movie.
[LAUGHTER] You could like, ah,
take an audio from a movie and you get the subtitles and you are looking for activate.
And every time the subtitle say,
"Activate", you could label your data.
That's super fun. That's super good actually.
You could label automatically using that.
Yet. So, that's a good idea.
I think there's another way to do it that is closer to
that which is we're going to collect three databases.
The first one is going to be the positive word database,
the second one is going to be the negative word database,
the third one is going to be the background noise database.
So, I take the background, 10 seconds.
I insert randomly from one to three negative words and I insert
randomly from one to three positive words
making sure it doesn't overlap with a negative word.
Okay? What's the main advantage of this method?
Programmatic generation of samples.
Yeah, programmatic generation of samples and automated labeling.
I can label. I know where I inserted my positive words.
[NOISE] So, I just add ones where I inserted it.
I can generate millions of data examples like that just
because I found the right strategy to, to create data.
You see the difference between the two methods.
The one where you have to go out and collect data and the one where you just go out,
collect positive words, negative words,
and then find background noise on YouTube or wherever you have the right license to use.
It's, it's a big difference and this can make,
[NOISE] can make your company succeed compared to another company.
It's very common. All right.
So, I would go on campus,
take one second audio clips of positive words,
put it in the database in green.
Take one second audio clips of negative words of the same people as well,
put it in the pink database and get background noise from anywhere I
can find it's very cheap and then create the synthetic data, label it automatically.
And you know, with like five positive words, five negative words,
five backgrounds, you can create a lot of data points.
Okay. So, this is
an important technique that you might want to think about in your projects.
The second thing that is important for the success of
such a project is the architecture search and hyperparameter tuning.
So, all of you, you will have complicated projects where you
would be lost regarding the cur- architecture to use at first.
It's a complicated process to find the architecture but you, you should not give up.
And the first thing I would say is talk to the experts.
So, let me tell you the story of this project.
Um, first I, I started
like looking at the literature
and figuring out what network I could use for this project.
And I ended up using that for, for the beginning part.
I use a Fourier transform to extract features from the speech.
Who's familiar with spectrograms or Fourier transforms?
So, for the others, think about audio speech as a 1D signal.
But every 1D signal can be decomposed into a sum of sines and
cosines with a specific frequency and amplitude for each of these.
And so, I can convert a 1D signal into a matrix for- with, with,
with basically [NOISE] with basically one axis that is the frequency,
one axis that is the time,
going from, going from 0 to 10 seconds.
And I will get the value of all the,
the amplitude of this frequency.
So, maybe this one is a strong frequency,
this one is a strong frequency,
this one is a low one and so on.
For every time step. This is the spectrogram of an audio speech.
You're going to learn a little bit more about that.
So, after I got the spectrogram which is better than the 1D signal for the network,
I would use an LSTM which is a recurrent neural network and add
a sigmoid layer after it to get probabilities between zero and one.
I will threshold them, everything more than
0.5 I will consider that it's a one everything less it's a zero.
I tried for a long time fitting this network on the data, it didn't work.
But one day I was working on campus and I,
I, I, I found a friend that was an expert in speech recognition.
He's worked a lot on all these problems
and he exactly knew that this was not going to work.
He could told me- he could have told me.
So, he told me, "There's several issues with this network.
The first one is your hyperparameters in the Fourier transform, they're wrong.
Go on my GitHub, you will find what hyperparameters I used for this Fourier transform.
You will find specifically what sample rate,
what window size, what frequencies I used."
So, that was better. Then he said,
"One issue is that your recurrent neural network is too big.
It's super hard to train. Instead, you should reduce it."
So, I've used- so,
he told me to use a convolution to reduce the number of time steps of my audio clip.
You will learn about all these layers later.
Ah, and also use batch Nor which is a specific type of layer that,
that makes the training easier.
And finally, you get your sigmoid layer and you output zeros and ones.
But because the outputs time-steps is smaller than the input, you have to expand it.
So, you need an expansion algorithm,
just a script that expands every zero in two zeros.
Let's say every one in two ones and so on.
And now I get another architecture that I managed to train within a day.
And this was all because I was lucky enough to find
the experts and get advice from this person.
So, I think you will run into the same problems as I run into during your projects.
The important thing is spend more time figuring out who's the expert
and who can tell you the answer rather than trying out random things.
I think this is a- an important thing to think about.
Okay. So, don't give up and also use our analysis which we are going to see later.
Ah, we have two more minutes.
So, I'm not gonna go over this one.
I'm just going to talk about it quickly.
There is another way to solve way chord detection.
And the other way is to use the triplet loss algorithm.
Instead of using anchor positive and negative faces,
you can use audio speech of one second.
Anchor is the word activate.
Positive is other word activate said differently and negative is another word.
You will train your network to encode activates in
a certain vector and then compare
the distance between vectors to figure out if activate is present or not.
Okay. We have about two more minutes.
So, I'm going to [NOISE] Oh, sorry.
My bad [LAUGHTER] just on me [LAUGHTER].
Ah, just to finish,
ah, with two more slides.
Ah, now that you've seen some loss function,
I want to show you another one and I want you to tell me
what application does these beautiful loss correspond to.
This one of the most beautiful loss I- I've seen in my life.
[LAUGHTER] So, someone can tell me what's the application,
what problem are we trying to solve if we use this loss function?
Speech recognition.
Speech recognition, no.
It's not the case. Good trial. Yes.
Regression.
Regression. That's true.
It's a regression problem but it's a specific regression problem.
Bounding box.
Good. Bounding box the object detection.
This is object detection.
So, I, I put the paper here you can check it
out but how do you know that it's object detection?
I've done it before.
Oh, you've done it before.
[LAUGHTER] Okay.
So, this is the loss function of a network called YOLO.
And the reason you can find out
its bounding boxes is because if you look at the first term,
you would see that it's comparing x to true x predicted x to pre- to true x,
predicted y to true y.
This is the center of a bounding box, xy.
Second term is W and H. W and H stands for width and height of a bounding box.
And it's trying to minimize the distance between
the true bounding box and the predicted bounding box basically.
The third term has an iden- indicator function with objects.
It's saying, "If there is an object,
you should have a high probability of objectness."
The fourth term is saying that if there is no object,
you should have a lower probability of objectness.
And finally the final term is telling you you have to find the class that is in this box.
Is it a cat? Is the dog? Is it an elephant?
Is whatever. So, this is an object detection loss function.
Actually do you know why, why you will have a square root here?
[NOISE] [inaudible] that.
The reason we have the square root is
because you want to penalize
more errors on small bounding boxes rather than big bounding boxes.
So, if I give you an image of a human like that and a cat like this,
you can have- So,
this box the one inside is the ground truth,
is very tight box.
This one same and the box that are predicted are the predictions.
So, these are the predictions and the other ones are the ground truth.
What is interesting is that a two pixel error on
this cats is much more
important than the two pixel error on this human because the box is smaller.
So, that's why you use a square root to penalize
more the errors on small boxes than on big boxes.
Okay. And finally the final slide, okay.
Let's go over that. So, just recalling what we have for next week.
Ah, you have two modules to complete for next Wednesday, ah,
which are C1M3 with the following quiz and the following programming assignments,
C1M4 with one quiz and two programming assignments.
You're going to build your first deep neural network.
This is all going to be on the web- it's
already on the website and we'll publish the slides now.
Ah, you have TA project mentorship that is mandatory this week.
So, TA project mentorships are mandatory this week
to start the week before the project proposal,
the week before the project- no after the project proposal,
after the project milestone and before the final project submission.
Okay. And Friday TA sections,
you're going to do some neural style transfer and R generation,
ah, filling the AWS form.
I don't know if it's been done yet.
We're, we're going to try to give you some credits,
ah, for your projects with GPUs.
[NOISE] Okay. Thanks guys.
