>> Does anyone want
to have a try?
>> [inaudible]
>> Yeah.
>> Is there any tracking
in this or is it just?
>> No. Just the discriminitive
parts, it's frame based.
>> It sounds good.
>> I use a balt light,
25-33 per second.
Yeah. I think we can.
>> I'd like to introduce you.
>> Yeah.
>> Yes. So, hello. We're very
pleased to have Chie here
who's studying for PhD
at Imperial with TK Kim,
and was previously at
Tingwa and Beijing,
and as you can see
she is an expert on
hand tracking and has also
been very involved in
the hand bench-marking projects
at recent CVPRs and
ECCVs and so on.
So, sorry ICCV,
and she's going to tell
us about hands. Welcome.
>> Thank you.
Yeah. Okay. I'm very
glad to have this opportunity
to present my PhD work
about 3D Hand Pose Estimation
using Convolutional
Neural Network.
So, recently
3D Hand Pose Estimation
has gained a lot of
interests with the
commercial depths a camera.
We can see many companies has
incorporate or are planning to
incorporate hand interaction
into their products.
3D Hand Pose Estimation can
be formulated as configuring
the pose space given
the depth image.
The pose space can
be represented by
3D joint locations
all angles between the pose.
So, assume we have
a data set that
consists of pairs of images,
input images and
its label for this image.
What we can do if we want
to make a prediction
for a new unseen image,
like we want to estimate
the joint locations
for new images.
Yes. We can simply learn
a mapping function f from x to y.
The full list function we can
choose CNN or random forest,
or other supervised
learning method
but we will have this question.
Will this method function
learn a good mapping?
If not what are the challenges?
So, these are the challenges.
First, little hands can appear in
any viewpoints
many complex articulations,
also in some viewpoints it
exists self-occlusions,
and many self similar parts
is also like we have
different shape.
Finally, acquiring labeled
dataset itself is a problem.
So, they are typically
three approaches to
solve this problem.
The first one is
the learning-based method.
It is what we had discussed,
learn a mapping from x to y.
It is efficient and the layers
no need to do tracking,
and there's no need
to do initialization.
Is like what little demo
I showed just now,
but you need to still
learn the function.
The second method is
the generative method.
We assume we have a hand model,
and it will search it
pose parameter space to
find a configuration best
aligned hand model
Chula input image.
So, this method if we use
a very good hand mesh
models for example,
it can achieve
very accurate results,
but as the hand is in
a very high dimensional space.
It is very hard to converge
without good initialization.
So, the third method
is a hybrid method
combine about two.
The discrimitive method can
provide an initialization for
the generative method like
Sharps and the Taylor's et al,
and also the prediction of
the learning-based method
can act two eyes a
patch of the energy for
the generative method.
So, my PhD work tackles
these challenges
learning-based method.
The first we're going to deal
with the viewpoints articulation,
and self-similar parts and
final work focus on occlusion.
At the same time we collect
a lot of dataset by
a design and automatic
labeling system.
So, my first work is motivated
by two observations.
The first one is, we have
many hand pose images.
Actually they are different just
because the viewpoint changes.
The second observation is
although the images demonstrate
compressed articulations.
Many images show self similar
parts like the square part.
So, we can use part
based method to help
our linear model to
generalize unseen images.
However, due to
many self-similar parts,
the model trained with
the parts-based method can
hardly discriminate
different parts.
So, I can challenge you to
identify which parts it
belong, is a quite hard.
But if we move our attention
like estimate hand pose
by following the hand structure.
If we know, let's say
the middle finger,
and if we follow
the hand structure,
we can identify
which parts they belong.
So, for each part,
we can learn offsets from
the center to the following
joined locations.
As well we can find
some similar parts like,
similar parts in our datasets,
we can estimate the parts
following new unseen images.
Then where we can concatenate
only the patches,
we can get it a whole pose.
Also as we estimate
the hand in a sequential way
by following this hierarchy,
we are able to reduce
the viewpoints at the same time.
We assume the palm
is rigid and we can
calculate the global rotation
by estimating the palm joints.
So, by lists estimated
joint locations,
we can align all viewpoints
to one direction.
So, this reduce
viewpoint variation.
So, we are not the first one to
use hierarchy or hand structure.
These two work decompose,
pose the joint locations by
following the hierarchy.
The difference of our work
from these work is
that we borrow in
a spatial attention image
recognition to transform
the input and outputs at
the same time to reduce
the input and output
space variation.
3D model with local
part viewpoints,
the variation of the
input space is reduced.
You can see the whole images,
it demonstrates many variations
but if all hand pose,
the variation is smoother.
So, if we, or at the same time we
organize the parts by following
the hand structure we avoid
relative [inaudible] to discriminating
the different hand part,
and also make use of
these self-similar parts
because the self-similar parts.
We just want to know the offsets
as long as we find
some similar parts in dataset,
we can make estimation
for the new purchase.
So, also for output space
it is largely reduced.
For example, if the back finger
pose a space for
middle finger tip.
If we use a product-based method
and the learner offsets,
less finger becomes
the brown one and if we
further reduce or align the
viewpoint into one direction,
the space for the fingertip
is confine by the red arch.
So, to attend on
the least local patches
along the hierarchy,
we apply a spatial
transformation to
the input image and other
output joint locations.
The perimeter of
the transformation
is acquired by estimation results
from the previous layers.
For example, if we want to
estimate the yellow joint,
we focus on part
centered on a estimated
red joint locations.
The t adds ty in
the transformation is
the location for other joints,
and the theta is the rotation
for the whole poles,
and b specifies the crop
to size of the patch.
So, the spatial
attention module is
actually between two successive
layer in our hierarchy.
It is trying to
form the other puts
and the rejoined locations
to patch and align it.
So, if we put all the
hierarchy layers together,
we can get whole pose.
However, if one
like in this slide,
if one of the estimator is wrong,
the following estimation
will be wrong.
So, listed hierarchy layer
exist the error
accumulation in our method.
We propose two solutions,
the first one is, we use
a cascaded stage to
define the estimation.
The second one is, we
use some criterion to
evaluate the estimation and
remove some bad hypothesis.
So, for our second method,
we extend the latency
of energy of
a town's low work which we
mentioned by introducing
low palm structure
and the bone length.
I will generate multiple samples
around the lower estimation
of let's say layer 1.
And they use the PSO to search
the minimal of low energy we use.
So, the whole pipeline
consists of four layers.
In each layer it consists
of cascaded stages and
partial PSO to estimate
the joint locations
in this layer.
A spatial attention module
is added between two
successive CNN to focus on
a local patch and align
input and output.
So, we can see after refinement,
the error accumulation
is largely reduced.
So, to show the efficacy of
each components we
use in our pipeline,
we provide multiple baselines.
Like for the first one is
a holistic CNN regressing
21 joints together.
Then the second one,
we have separate
CNN to estimate
low global rotation.
The length rotates
all images to one direction,
and they estimates
to the 21 joints.
For the baseline, we first
estimate all CNNs together
and they apply
less spatial attention
on some local patches
to refine estimation.
The fourth and the five is
our proposed methods
without the refinements and
the error accumulation and
the one way of refinement.
>> So, between [inaudible] sizes
and the CNNs and the baselines?
>> In our model, we
have multiple CNNs,
so we settle aside our baseline
CNN, the memory size.
They store the memory size
same with all proposed method.
So in size or
the capacity of receiver
is roughly the same.
So, we can show each
component we add
improve the performance
on the data set.
And here, we show the view point
of distribution by using
the estimated rotations
from the palm joint,
so it's in different
the cascaded stages.
So, the blue line is
original distribution,
and after the refinements,
the rotation variation
become largely reduced.
So, then we compared
our method to
many [inaudible] work and
HSO is the town's work,
it's quite similar to our method.
This work can be treated as
exstention objects work,
but we can see, a liste set
and then our data sets,
we improve it by a large margin.
So, the next work
is about data sets.
So, the [inaudible] method can
help a generalization when
there is no enough data.
There are still in
many cases we can not
even find the similar parts
in a data set.
So, we want to collect
a larger data set.
Sure, we can try to using
size and images to
get larger data sets,
like MSRC's synthetic data sets.
But let's insert the images up
here quite different
from the real images,
and it is also not easy to
design very natural poses.
So, before BigHand,
the largest area data
set is HaandNet,
but only the label six joints.
So, these two factors
motivate us to design
automatic labeling system with
forehand pose as a notation.
So here, I will not
go into the details,
but show the space coverage
of the data set.
The BigHand data set, this one is
the viewpoint, and
[inaudible] articulation.
We can see the BigHand
demonstrate
a broader and more even coverage.
Then compare the white wave as
a wearer and align your data set.
So, using this data set,
we hold a lot of
2017 hand challenges
and the top three achieved
error about 10 millimeters.
And the demo I showed just
now is about similar,
our top three is about
a 13 for unseen object.
So, we have already
got an error as low
as 10 millimeters,
so if you have
the least functions out all based
on the discriminitive method.
So, we have this question,
does learning a mapping function
solve all the challenges
we mentioned at the start?
When we collected a data set,
we concede on the design method.
We concede on the viewpoints
articulations.
But we're having to
consider the occlusions.
So, can the mapping
function work well under
severe occlusions?
So, in this work,
I want to model hand
pose and occlusion.
So, before we start, let's
figure out a little
problem first.
So, when we collect
the datasets for the big hands,
we have leaves of observations.
For many images with occlusion,
we have multiple one to
choose for occluded one.
You can see when
the camera is over there,
and then we can get some
many at articulations.
So, the blue skeleton are
ground truths for visible
joints and the red,
and the yellow is
a full occluded one.
So, our problem becomes
even when using
a learning-based method,
our problem is, so
we're making one image
to multiple joints allocations.
So what will happen if we use
a CNN trained with a
mean squared error?
So, we first look at an example
image with occlusion.
The red skeleton is
a prediction of a CNN trained
with a mean-squared error.
For visible joints, the
estimation is more or less good,
but for occluded joints,
the prediction is not
close to any ground truth.
So, to further
clarify our problem,
we assume our datasets to train
the CNN contains two exactly
the same images of a,
and the ground truth is yellow
in d. After convergence,
let's say we are given
estimation of the red part,
is the average of
these two labels.
So apparently, the average is
our loss estimation for
articulated object.
So, what should we do?
For the visible joints,
we have only one,
for occluded one, we
have multiple locations.
So, we can think of a model,
these two cases by different
model or different loss.
So, we propose to solve
this problem by
a two-level hierarchy.
The top hierarchy is we
have plenary variable
to represent a joint
either visible or not.
If it is visible, we have
a unimodal distribution
to model this one,
and for occluded one,
we have a unimodal distribution.
Then, by depending on the
visibility in the top level,
we can switch between
these two cases.
So, the binary variable
representing whether a
joint is visible or not,
follows a Bernoulli distribution.
So, when a joint is visible,
it is actually deterministic,
but if we're considering
in a level noise,
we can seal up the joint if it's
drawn from a single
Gaussian distribution,
so we model it by
a single Gaussian distribution.
When the V is zero,
the joint is occluded.
So, we use Multi Gaussian
with j component
to model this mathematical
model labels.
So, with all
the components defined,
the distribution of the joint
locations Y depends on
the visibility is this.
Under the joint
distribution of Y,
and the low visibility
is like this.
So, the two-level hierarchy is a
shown in the conditional
distribution.
First a sample V is drawn
from the Bernoulli distribution,
and then depending on V,
the joint location Y is
joint from the union
Gaussian distribution
or Multi Gaussian distribution.
Last, so our proposed method
can switch these two cases,
and they provide for
these description of a hand
opposes under an occlusion.
So, now we have
defined our model,
and the next problem is,
how do we set the parameters?
Note the hierarchy
mixture density is,
and also the joint distribution
are conditioned on input image X.
So, we can see these parameters
are actually dependent on,
or they can be represented
by a function.
As the joint distribution
is differentiable,
so we choose a CNN to
learn these parameters.
So, given a dataset of image and
this possible joint locations
and visibility,
the likelihood of
the whole data set is
the product of
all the individual joints
and for all images.
Here, we assume all the joints
are independent of each other.
So, our goal is to learn
a neural network that
your parameters maximize
the likelihood.
We use a negative
logarithm likelihood
at a loss function.
So, this loss function
consists of three terms.
The first term is
about visibility,
and the second term
is for visible joins,
and the last one is for a multi,
for joints and the occlusion.
So, during testing where image X
is fed into the network,
the prediction for one joint
is diverted to
different branches depending
on the prediction
of the visibility
w. If w is larger than 0.5,
the prediction on sampling
for the visible joints
is first branch
the Single Gaussian.
Otherwise, is in second branch.
But the problem is that when
a prediction for the
visibility is wrong,
the sample or a prediction
for the joints
will be wrong because we
go into the long branches.
So, to help all these
bounds problem,
instead of using
the binary visibility label v
to compute a likelihood
or a loss of function
during training.
We use samples drawn from
the estimated distribution.
So, when a number of samples for
the visibility is larger enough,
and the average of
these samples become w,
so we replace v by
w during training.
The least, we call it a softer
version of our proposed method.
So, to demonstrate
superiority of our work,
we construct two variations.
The first one is a is
a method like a model,
all the joint locations
by a single Gaussian.
The second way is model
all the joints by
Multi-Gaussian distribution.
So, in this figure,
we draw 100 samples for
each finger tip from distribution
of all these methods.
So, for the visible
joints, this line,
the SGN and the HMDN actually
are produced compared
to samples around
the ground of truths.
Well, for an MDN,
we can see it's actually have
a broader range
compared to these ones.
For the occluded joints,
where a sample is
produced by SGN,
the samples produced by SGN
is quite in a broader range,
and these overlaps
with other joints.
The samples produced by HMD,
our proposed method,
this scatters
in movement range of the joints.
So, we can see
the HMD and combine
the advantage of SGN, an MDN,
and is able to produce
compared to samples
for the visible joints
and an interpretable samples
for joints and occlusion.
So, to further demonstrate
the distribution
predicted by our method,
we represent the distribution
by the spheres.
The spheres lays the components
of Multi-Gaussian,
and the center is the mean,
and the radius is
the standard deviation,
and the transparency
is in proportion to
the mixture of
different components.
So, we can see HMDN actually
combine the superiority
of SGN and MDN.
So, to compare these variations,
we draw one sample and the
Joe compile the sample
to the post label.
The average error for
the visible joints and
the occluding joints
are listed here.
We can see, for different number
of Gaussian components,
each MDN consistently
improved from the SGN or MDN.
All motivation is
actually modeling
the distribution and occlusion.
We list these two distributions
by getting samples from
the predicted distribution.
We compare a set of ground
truth labels and then the sets
of samples with Joe and then
calculated layer minimum distance
and see how well they
align each other.
If the layer are quite aligned,
the distance should
have been closer.
So, we can see
the methods we propose
improved a lot from SGN,
fully occluded one.
So, then compared to MDN,
it should achieve
a similar accuracy.
Yeah. So, previously, we propose
to mitigate to let you post
the bells during testing.
So, by sampling from
the visibility distribution
at the training.
So, in this table, we show
the soft version considering
to better results,
layer harder version.
The way we compare
it with prior work,
we choose our datasets
that contains
a considerable number
of occluded joints,
both in testing and training.
We compared three methods,
the first figure shows our
comparison with these method.
Using the comma,
used measurement may
take low proportion of joints
under certain threshold.
The first one is ICVL's work
and then the second one
is Tom's work.
We compare them in the last
two figures by sampling,
because these two methods can
provide multiple other posts.
The the ICVL's work
is because we jittery
[inaudible] estimation can be
treated as a Unit Gaussian model.
In Tom's work, they have
a GMM fit into the leaf nodes,
so it can be treated as
Multi Gaussian distribution.
So when we compare
these two methods,
we draw multiple
samples and choose
the closest distance to
the ground truth
among the samples.
So we can our methods
actually can
achieve lower accuracy when we
draw more samples
and the variance of
our method is smoother
than these two methods.
So here is all my
work. Thank you.
>> Thank you very much.
Time for questions.
>> Yeah. So in one of
the tables you show that
your hierarchical HMDN
method actually so
that was also in
visible joints not
only on the occluded joints.
Any intuition like that
might be the case. So, yeah.
>> So, you mean-.
>> You also have improvement
in visible joints not
only in occluded.
>> Yeah, yeah, yeah, yeah, yeah.
For the visible joints, actually
you mean why it should improve
on the visible joints.
Yeah, it actually should have
because also in reference,
in Bishop's like on 1995 paper,
they also showed when we model
the multimodal problems
using like the distribution.
It helped the learning for
the single model occasions.
So I think this is
because during learning,
where we have error
from these two parts,
from the multi-model part,
is learned easily,
is converted easily.
So it helped the network to
allocate its capacity in
visible or single Gaussian.
Oh, no, no, no,
single model part.
So if we treat only these cases
by using a single model
or using mean
squared error is actually adds
burden or maybe misleads
the network to converge
to a bad minima.
>> That is something
that you trying to
extending that is because
of the occlusion,
the suggestion with two hands and
[inaudible] first and yeah,
you haven't tried that.
Now, you have to work
on that suggestion?
>> Yeah, I think it is also
for- I haven't tried
to lead two hands,
but like one part is
the difficulties,
the detection part; how to
discriminate these two parts.
But once if we have
got the information
like the crop one,
I think it will work.
But I'm not sure because it
is quite complicated and is
just a single part
especially where
you have some occlusions.
I think a tailor is like,
the listen to work,
they have to deal with
these slightly interactions
but also two-hand.
But for this work itself
is two-hand if there
is no interaction.
>> How the training
data was labeled?
I mean for an acute joint,
did you supply multiple labels
or use like one per example?
>>Yeah, actually
because in a training,
we didn't do any like a menu,
like a supply because if
we look at the post part,
start form the label parts.
If we look at the list label,
it actually has
a corresponding input image,
maybe connecting it, each label
has corresponding input images.
Therefore, we have
different labels.
So during training, we
didn't group any ones,
we are just like fit
them into the training.
We don't like to provide,
because it actually exists in
the dataset but for testing.
>> Yeah, that type of dataset
was collected from that.
>> Yeah, yeah, so
in the collection,
when actually I perform
these as many as possible.
>> Right. Yeah.
>> Yeah. So for testing,
we need to group
some post together,
similar like in a sequence,
to make them like
if they are very
close to each other for
the visible joints,
which means these are
multiple ground truth.
Yeah, yeah thank you.
>> How was it's annotated?
So, 2.2 million depth
images are recorded,
I'm just not familiar how was
the ground truth attached?
>> The ground truth
is like we have
a checker stat system,
like each joint is attached with
a sensor and we have six sensors.
>> Okay.
>> Yeah, once they attached to
the list and other one
is fingertip,
because the sensor provided
both locations and the rotations.
Local word rotations. So if
we assume the palm is rigid,
we actually can
calculate other joins,
better location,
and the rotations.
>> Right. So in terms of data
recorded that it's actually
the same as having that
that you compare before.
>> Yeah, yeah, yeah.
>> It's just that then you fit
a hand model to
the six recording sensors.
>> Yeah, yeah. Then with
some inverse kinematic like
if we get this joint location,
this joint location,
we have rotation.
We actually can
calculate these parts.
>> So how worried you about
your ground truth not modeling
individual fractions base and
this might look very similar
in your sensor data?
>> For the sensor data,
because when you do this,
rotation actually change because
it also record the rotation.
The sensor is a very sensible,
it can give you
different rotations
and then you can calculate
the other joints.
>> I'm not familiar how does
the center interfere with
the depth data at all?
>> We have the wires but
the problem is that,
yeah the thing is
that the wire is
quite thin and the sensor
itself is thin.
In a depth camera,
in the depth image,
you can hardly see these wires,
but in RGB, you will
not use this data.
>> In RGB, you would
easily see them.
>> Yeah, yeah, yeah.
So my current work
is about transforming
the depths to
the RGB will generate something
like some game or
something. Yeah.
>> Thank you very much.
>> Thank you very much.
