Hey guys, so this is—no joke—a list of
all my real life friends, but the problem
is just that- they're kind of ugly.
[ Illuminati Music ]
Since I have such ugly friends
and since computers are so darn powerful
why don't I just have my computer
generate prettier friends?
That means the goal is to have my
computer automatically generate a wide
range of human face images
without any human work required.
One option is to load up a videogame like The Sims
or Nintendo Miis and randomize the settings
of their avatar creators.
But that's not "ma-chine learn-y" enough and
I know-- I just know! that machine learning is
what you guys want. So let's take a look at
convolutional neural networks
some of you viewers will already know what this
is but I want my videos to be as
beginner-friendly as possible so I'll
assume you know nothing.
Say you have an image
meaning a two-dimensional array of
pixels that are all either black or white.
You want to find out where all the
donut shapes are. How would you go about
doing that? Well, let's make a donut filter
it will specify the requirements for
something to be a donut.
Then we'll center our filter around the upper-left
pixel and ask:
“Is every condition of the filter satisfied?”
No?! Then it's NOT a donut.
We can move the filter over each
pixel, asking the same question, most
pixels like this one will say, "no donut"
since not every condition is satisfied...
...but a select few will say, "yes donut."
After that's all said and done, we now
have markers at the center of where all
the donuts are. Voila! Our goal is complete!
However, most images aren't as
simple as black or white. For most images
the pixels brightness exists on a
spectrum from 0 to 1 so it could be 0.5
or 0.1 (ignore color for now) so when
we're searching for donuts we can't use
a filter that's so simplistic it only
asks for yes or no questions. Rather than asking:
”is this a donut, or is this not a donut?”
Our new improved filter, should
instead ask the question:
”How donutty is this pixel?“
On a continuous scale from
negative infinity to positive infinity
with higher values meaning this is more
like a doughnut and lower values meaning
this is less like a donut. How can we
engineer a filter that does this?
Well, let's imagine the filter is a set of
multipliers—like this—some multipliers
are higher than others
some are positive and some are negative
but let's see what they do.
We can center the filter around a single pixel and
then we can multiply those underlying
image pixel values by those multipliers,
add up all those products, and we get an
overall score of how "donutty" that pixel is.
You can think of the positive
multipliers as if they're saying:
”If you want to be considered donutty,
you'd better have a high value for this pixel.”
And the negative multipliers like they're saying:
”Ooh donutty pixels
don't typically have high values here.”
In the end we can apply this
continuous "donutty" filter to every pixel of the image.
So this pixel with a score of 3.64 is
the most "donutty." Which makes sense
because it's a dark pixel surrounded
by quite a few light pixels.
A few other contenders get pretty close.
Now this pixel has the worst donut score,
which kind of makes sense because it
looks like an inverted donut.
By the way—if you're curious—there are quite a few
methods to handle the literal edge cases.
You can cut them off,
fill the exterior with zeros,
extend the boarders to infinity,
or just loop the image.
For our example, we'll just fill the exterior
with zeros because it's the easiest to understand.
And also, each application of
a filter is called a convolution which
gives the convolutional neural network its name.
But hold on! The donutty score of
each pixel is a scaler.
Meaning: a number on a one-dimensional number line.
And guess what!? The original
brightness of each pixel was also a scalar.
What does this mean? It means that
applying this continuous donutty filter
converts data of one type into
data of the same type.
In other words:
it converts a grayscale image into another grayscale image.
So if we wanted to, we could apply this filter to
the image once, and then again, and again, and again.
Forever. To be honest that's actually not
very interesting. What is interesting, is
if you apply a different filter in the second layer,
and a different filter in the third layer
and so on-- and also:
If you apply multiple filters to each image,
creating this giant web of filters,
each looking for different things.
Since each filter can be different,
you don't have to be searching for just donuts.
You can have one filter that's good for
finding vertical lines, and maybe another is
good at finding horizontal lines.
At the second layer you can combine the two,
to create a filter that finds cross shapes.
Think of it this way:
Perhaps the first layer can find edges,
then the second layer takes those edges as input.
That means the second layer can find edges of edges,
meaning corners.
The third layer can find edges of edges, of edges.
Here, interpretation gets a little fuzzy,
because we humans don't really know how
a computer effectively uses its filters.
But I'd guess that edges of edges, of edges,
could be used to detect arrangements of corners;
in other words, simple shapes,
like equilateral triangles.
Perhaps, further layers could see
arrangements of triangles,
and further layers than that, can soon detect
whole objects. From pencils, to apples,
to chihuahuas, to humans. With more layers, and
more convolutions per layer, you can find
more and more advanced features in your
original image. Got three or four filters
that can find ridges of darkness at just
the right angles? Boom. You've got a nose detector!
Use a few other filters to find
pairs of dark ellipses that are twice as
far apart as their width, and there's
your eye detector. Add in the rest of the
body parts somewhere else, and then
combine them in a final convolution that
makes sure they're all in the right place.
And you've just got a web of convolutions that
tell you—exactly—where there are
human faces in the image.
hmm... Doesn't that look familiar...?
Okay. It doesn't tell you exactly where
the human faces are,
since neural networks behave
randomly and unpredictably,
they'll never achieve 100% accuracy,
but they can get into the high 90s pretty easily now.
hmm... I brushed over this topic. But usually,
interspersed throughout the webs of convolutions,
you have points where you just downscale the
image by a factor of two,
and this is called pooling.
If you downscale enough you can
slowly convert your image, of thousands of pixels, into
an image of just one pixel.
Which can either be light or dark,
or anywhere in between.
Essentially, this can be used as a marker
to look at a whole image, not just one location,
and answer, "is there a human in this picture...?
...or was this image taken indoors or outdoors...?"
If the final pixel's brightness is one, that means yes;
but it's at zero, that means no;
and anything in between means maybe.
I also bet you're asking how to deal with colored images.
Simple.
Almost all photos have three color channels:
Red, Green, and Blue.
So you can just interpret that as
three different grayscale images overlaid
on top of each other. That means you can
just set up your convolutional neural network to
have three images in the earliest layer,
instead of one.
Pretty simple actually. Each color of RGB is called
a color channel and convolutions and
further layers are also called channels
more advanced CNNs can have like 40 or
60 or even a 100 channels in a
single layer because that's how many
features they're simultaneously trying
to search for. So yes, this is a
convolutional neural network. It takes in
an image of 𝘯-channels as an input and
outputs a scalar or an 𝘯-dimensional
vector if you're looking for multiple
things, or just whatever you want it to output.
That's great at all but even if
you were to program in this whole
structure perfectly, you still wouldn't
have a working convolutional neural network
because you'd have no idea what to set the filters to.
I mean, the filters determine what the network is even
searching for, so they're pretty darn important.
Maybe you could set them up
manually using your own common sense to
figure out what elements each filter
should specifically be designated for?
That would be the hardest math puzzle
of all time-- please don't do that.
Instead we want to use a ton of training data with
labels of how we would want our network
to respond to this data,
and gradient descent,
and calculus,
and math,
BUT UH-OH!!!
This video is already getting long,
so I guess it'll have to wait for part two.
Besides, you guys are getting impatient
and probably just want to see what my
new prettier friends look like.
Okay, I can introduce you to them.
At the beginning all these filters I mentioned earlier
are set with random values,
so you'll see nonsensical images,
but then it'll train to get better.
The training data is 15,000 images of celebrities
from FamousBirthdays[dot]com.
I'll explain why I chose the source in part two.
The machine learning program I'm using is
called "Hyper-Găn" by 255-bits.
Which is Martin and Michael, and I'll also explain
why I chose this in part two.
Also, the timer at the top, shows how long my
computer has been training for,
in the hours minutes seconds format. { HH:MM:SS }
Anyway, enough talking-- Let's go!
Yep! Yep! These are my new friends all right!
So much prettier than my old, real life friends.
I am so excited to hang out with
this beautiful, new crowd.
We can watch movies, go bowling,
rip out my brain cells and
replace them with neural networks,
go shopping, eat dinner.
It'll just be a blast!
Let me answer some questions
while an irrelevant time lapse plays.
“What was that music during the training time lapse?”
It's "Skyline" by "JujuMas"
who you should really go subscribe to.
“What happens when you train it for more than 7 hours?”
Not much.
I actually trained it for a day,
and the results didn't get significantly better.
Which brings me to...
“Shouldn't you remove the non-photographs from the training data?”
Yeah I should, but it takes too much work
to sift through 15,000 images,
and if the non-photographs are a small enough proportion,
they shouldn't affect the end result much, anyway.
“What was actually your procedure for setting this up?”
Again, I'll talk about the details in part two.
Before I end this video, I want to point out that
many other—actually smart—researchers
have gotten much better results than I have.
For example, the HyperGAN GitHub page itself
shows much larger more realistic looking
generated faces-- that just-- I mean look.
Can you even tell these aren't real?
And then, I keep seeing even better,
and better results, as time goes on,
on the R–slash–machine-learning subreddit.
That might lead you to ask,
“Cary, why would you spend so long showing your
own—mediocre work—when other people have
literally done exactly the same thing as you,
but 10 times better?”
And that's a valid question.
I'd like to think all my projects in the past
were unique in some way, but this one really isn't.
But one, I want to make it more visible to more people,
because I feel like a not that many people read the academic papers,
but a lot of people are on YouTube.
And two, this whole journey has really just been
to prove to myself that the code that's
used to generate these images can indeed
work successfully on just my computer alone.
No more relying on
what other people post the results could be.
I want to see my computer reaching those
results myself. Anyway, I got a juicy
NVIDIA GTX 1080 GPU for this,
so I want to  make sure I can use it to it's full potential.
But don't worry, more original stuff
is coming in the future.
Like this!
What's this image?! I'm so confused!
This is unlike anything I've ever seen before~
hm-
I better subscribe to "carykh" to find out what all those
interesting lines are!
I can't believe I stooped that low...
Okay, end of the video, but I want to
promise to all the people who've been requesting.
I am going to make a ton of tutorial videos
from here on out.
For example, showing you how to program
a neural net completely from scratch,
assuming you know nothing;
or how to replicate the results I got in my
Baroque music video...
It's all coming, just be patient, and good bye.
