Have you ever noticed that it is hard to have
eye contact during a video call?
This is because people don't usually look
into the camera during a call.
Instead, they look at the other person's image
on their display or sometimes they even look
at their own preview image.
In a typical video conferencing setup, the
camera and the display are not aligned with
each other.
This camera/display/user geometry creates
a gaze disparity that makes it hard to maintain
eye contact and have a natural, face to face,
conversation.
We propose an eye contact correction model
that restores the eye contact regardless of
the relative position of the camera and display.
Whether your camera is located at the top
left, center, or right corners of your device,
our model redirects the gaze from an arbitrary
direction to the center, out of the box, without
any calibration.
Our system uses a deep convolutional neural
network that redirects the gaze by warping
and tuning the eyes in the input frames.
To prevent this model from modifying the input
images beyond repair, we trained it in a bi-directional
way that enforced mapping reversibility.
Our model learned how to move the gaze towards
one direction and to revert it back to its
original state to reconstruct the original
input image simultaneously.
This training setup reduced the artifacts
and resulted in more natural gaze correction.
To train our model, we used millions of image
pairs where a subject looked into two different
directions in each pair.
So, where did we get all this data from?
We generated a large portion of it synthetically.
We used a generative adversarial network to
refine the synthetic samples in our dataset.
The refined synthetic samples looked virtually
indistinguishable from the natural ones.
Being able to generate photorealistic synthetic
data allowed for generating an immense amount
of perfectly-labeled data at a minimal cost.
An interesting phenomenon we observed is that
our model learned to predict the input gaze,
as a side product, at no additional cost.
In this demo video, for example, the circle
shows where the model thinks that the user
is looking at.
This behavior is likely a byproduct of training
the model without explicitly providing a redirection
angle.
The model simply infers the input gaze to
be able to move it to the center.
We didn't train this model to be fully-blown
gaze predictor, so it shouldn't be expected
to be as accurate as systems that are designed
particularly for that purpose.
It's an interesting observation, though, that
our model learned the input gaze implicitly
to function better as an eye contact corrector.
We use a set of control mechanisms that prevent
creepy results and ensure a smooth and natural
video conferencing experience.
Those mechanisms control the strength of the
correction, prevent creepiness from overly
corrected eye contact, and ensure temporal
consistency in live applications.
For example, we smoothly disable correction
when the user is blinking or clearly looking
at somewhere other than the camera and display.
Overall, the control mechanisms prevent abrupt
changes and ensure that the eye contact corrector
avoids doing any unnatural correction.
Our eye contact corrector runs in real time
and supports a wide variety of video conferencing
capable devices, having different display
sizes and camera placements.
Our model preserves details, such as glasses
and eyelashes, without hallucinating details
that do not exist in the input.
We built this system primarily to improve
the quality of video conferencing experience.
However, it can be used in other cases too.
For example, personal broadcasters can read
a script from the display of their device
while maintaining eye contact with their viewers.
Take a look at our paper to learn more about
how our system works.
