we've talked about using facial points as
features descriptors
Indicators whether there's a facial expression or not as you can see you can just look at let's say the angle
Here and compare that with the angle here, and they are very useful
They're very powerful, but they can't decode everything for instance if you have a little dimple
in the mouth there's no real movement of the the
of facial point
It's just [that]
This mouth Corner is sort of pulling inwards, so [you] don't get a geometric
variation you do get changes in appearance
It's these kinds of features [that] we're looking at when we're looking at appearance features. So they're [very] good at looking at
first when I open a frown and
So actually when I smile the corner of my mouth
will look very different from a neutral position, [so]
The question, then is how are we going [to] encode?
disappearance and we want to
minimize the variation in our you know features and our scripted that's caused by things that are irrelevant to expression such as
lighting or
Identity and
We want to maximize the variation or features that are relevant to facial expression and in our case
That's quite open edges
so we want to sort of encode edges in a very cheap way and a very interesting and
highly successful feature for that is called the local binary Pattern
Local binary Pattern looks at nine pixels at a time
So it looks at a little block of three by three pixels
And it's particularly interested at the Central pixel
So let's say that the pixel value of our central pixel is eight
And it has eight pixels around it, and it's nine block
Let's put in some numbers
And that make some sense a local binary pattern is now going to turn this
Set of nine by nine pixels into a single value, and it'll do that by first
Comparing every neighboring pixel with the Central pixel this is the intensity value or the luminosity value
We're not looking at color although. You could do this three dimensions, but normally people [just] look at it in
Grayscale values, so we're going to compare every neighbor of this center pixel with the center, and if it's greater than or
equal to the Center
We will assign a [1] and if it's smaller than that will assign a zero so 12 is bigger than 8
So that's a 1 15 is bigger 1 18 is bigger 3 is smaller than 8 as is 2 and?
[1/8] is equal to 8 so that's a 1 again and 5 is smaller
And then we're going to turn these 8
bits basically because they can only have a 1 or a 0 value you're going to turn those into one byte 1 1 1
0 0 0
1 0 as long as we're consistent we can turn any ordering of these numbers into one
String of numbers which we then turn into a decimal number
Which we will be using to train our system the nice thing about these local binary patterns is
That it is
illumination Invariant if you change
The lighting on the scene all these pixel values will go up
But the relative difference [between] the pixels will remain the same
32 will still be bigger than 28, so your binary Pattern will remain the same
Irrespective of illumination variation in general, so that's a shadow now as long as we're talking about
Constant you do get aberrations. You do get difficult situations at
The point where you have a cast shadow
But I'm only at the location of that cast shadow because we're usually looking at [3X3] pixels
this is not a big problem because what we're now going to do is we're going to take a face and
It's our big smiley face think I'd be better in drawing faces by now after 10 years of working this area
But I'm not we're going to divide
this area into a
number
of blocks the [moment] [I'm] choosing 4x4 and
this local binary pattern it's centered on a single pixel and then compares with its neighbors, so
Basically we have to do this for every pixel in this block and each of those will result in a different
decimal number for this block here you might get values of 2 3 4 2 8 8
13 12 ETC, and if there's enough pixels in that block if the block is big enough
We will actually turn these values into a histogram so basically looking at the statistics
How many times did 13 come up? How many times did 12 come up because there's only 256 different values in?
this block you actually get quite robust statistics in practice we use something what's called uniform local binary patterns and
They only have 59 different
Possible values rather than 256 so you get really quite robust statistics the other thing that
Local binary Patterns and code is edges as I said we're interested in Edge
detectors in the Edges that Sort of show you the outline of
the mouth or the
Eyelid and as you can see
Here you've got three ones
then a set of zeros and one and a zero basically what that means is that you have a
Transition here from a [1] to a [0] you've got a transition here from 0 to 1
from 1 to 0 again f of 0 1 those
Transitions are edges so we now very clearly
indicate where you've got a transition from A
Light area in the face to a dark area in the face
Which is exactly what an edge is so we've turned a possibly very high dimensional
Space [that] was based purely on pixel intensities into a low dimensional space that only
encodes relative
intensity values and in doing so
Encodes edges so we now have got an illumination invariant descriptor of Edges when you think about it
Facial expression recognition is actually action Detection. You're not necessarily
[you're] not really interested in the static smile
You're interested in the fact that I'm you know went from a neutral face to smiling. So you're looking at differences
You're looking at actions movements and all these descriptors the appearance descriptors
They only describe the edges in one frame [its] static
So what you really want to do is you want [to] see how these [pixels] change over time one way of doing that is
you could actually extend this block to become a cube and
you would get
comparisons between the center of that Cube
Somewhere down there and all its neighbors you would have to ^
26 different Possible by values
And that's just saying that as it goes back into 3D
That's time communities exactly, so if we're now going to look at not at a single frame, but at a set of frames
Let's say three frames, so then this is our y direction. This is our
x direction this is our
Horizontal and vertical space, it's just a normal image, and then this is time
Basically saying that this is the first frame. That's the second frame
And that's your third frame you can now look at the differences not just within [one] frame between the Central pixel
but also the difference between this pixel and
The pixel at the same location in the next frame or the pixel
In the next frame, but a little bit up you should get a cube of pixels around this central pixel 9 in front
9 in Back and 8 surrounding it in the in the current frame
That's a total of 26 different neighbors
So you get 2 to the power of 26 different possible values, and that's a lot instead you can do a little trick, so
you can say I'm still interested in the changes over time but
[it] might not be interested in just every possible change. I just want to look at 3 or token all planes
That's it called the first
Orthogonal plane of course being just the xy your normal image, and then you take a slice through
X and t so there's a horizontal slice
and you take a
Slice through the vertical and time, so that's a vertical slice there that one is orthogonal to
The x and t and of course it's orthogonal
to the
Normal xy plane, so [that's] why it's called three orthogonal planes in each of the network codes either
Edges in
Space in just your normal 2d image or an edge in
x and time or an edge in one in time and if you do this
You get 3 times 2 to the power of 8 solutions?
Which is a lot smaller than 2 to the power of 26?
and you still perfectly encode movements actions of edges over time and
If you do this, you will get significant performance increases
Illegal patterns get their own code and all the illegal
Patterns get meshed together into the 59th
