We all have come across or might have even
used the snap chat filters which can overlay
funny graphics on your face.
Also Apple has recently introduced its so
called Animoji which mimics your expression
into a 3D animated character that looks pretty
realistic.
[Music] All of these are interesting to play
with but you might have wondered how this
stuff actually works.
So basically its a type of machine learning
model called convolutional neural networks
when given enough labeled data to train on,
learns the facial features like eyes, nose,
mouth and then returns the position of key-points.
This technology is a valuable tool in areas
like bio-metric facial recognition, expression
analysis and also in medical diagnosis.
We'll be exploring this today and building
a model from scratch to detect facial key-points
in real time.
And by the end of this video we'll be able
to build something like this, so stay focused.
So first we need some data to train our model.
Our data-set should contain and its labels
containing the x-y coordinates of the facial
key-points.
Well after looking on the web for some freely
available data-set, i found a data-set from
Kaggle which seems to contain 7000 (7049)
training images with 15 facial key points.
But after downloading and examining the data
with a simple python script, I observed that
only about 2000 images had 15 key points rest
of them had just 4.
So what we could do now, just scrap all the
image data with 4 key points and just keep the
ones with 15.
And now we are left with a training set of
2140 images which might be bit low, but we
could work with it for now.
We’ll be coding the model with python using
Keras library.
So here we have a convnet architecture containing
5 major layers, 1 fully connected layer and
a target output layer.
Each of the major layers contains a convolutional
layer and a pooling layer of size 2 by 2.
Now the first convolutional layer contains
16 filters, the second one has 32, third has
64, forth has 128 and the fifth has 256 filters.
All of these filters are of size 3 by 3 and
we’ll be using relu activation function
in these layers.
Well after that the flattened values from
the fifth major
layer will be fed to the fully connected layer
with 512 nodes then the values will be finally
sent to an output layer with 30 nodes.
And that's our network architecture.
And in case you are wondering how I choose
these specific layer setting its just trial
and error process testing the dataset with
different settings and choosing the one which
gives the least loss and fits the data well.
Now the model architecture is ready, let's
train our model.
Here we’ll set the number of epochs to 60,
batch size to 64 and use ‘Adam’ optimizer
with mean squared error loss function.
It might take from 3 to 15 mins depending
upon GPU or CPU your using.
So as the model was training we saw that the
accuracy has gradually moved up from about
50% to an accuracy of around 84%, which seems
reasonable.
I tried modifying the network by adding few
extra layers and also tuning some hyper-parameters,
but about 84% was the highest accuracy I could
achieve.
I'm sure by using bit more data to train our
model might yield a higher accuracy.
But for now, this looks pretty good.
So if we plot the loss graph of the training,
it looks like there’s a steep curve in the
beginning, and then both training and validation
loss flattens down and pretty much converges,
this confirms that our model didn't over-fit
the training data but actually understood
the facial features well
So now we got our model full trained and ready,
we can test it with some random images from
the web, but before that, we need to make
sure only the gray-scaled images of dimension
96 by 96 is sent to the model not the whole
image which might have many faces at different
location.
To achieve that we need a face detector system
which could give us the location of the face
in the image.
For that, we could either build a separate
neural network model or use Haar Cascade face
detection algorithm, which is lightweight
and pretty robust to handle in our case.
So after writing a few lines of code with
OpenCV, I was able to get it working to give
location coordinates of the face then we could
just crop that part of the image to get just
faces.
Now the straightforward task is to feed each
of these face images into our trained model
to get the detected facial key points.
To make it work on a video or in a real-time
video feed.
We can using OpenCV to read the video
file then we’ll run it through a while loop
where each frame of video is sent to
our model which will then return the coordinates
of detect facial key points.So here are some
of the images that i have tested it with.
Also some video clips.
And thats it for this video, if you want to
download the code for this project i have
linked it below in the description, you can
download it, its a jupyter notebook file.
Also please make sure to like the video and
subscribe since ill be uploading more content
related to machine learning and AI, so stay
tuned.
Also if you have any doubts or suggestion
leave them below in the comments and thats
all for now, See you next time
