Dear Fellow Scholars, this is Two Minute Papers
with Dr. Károly Zsolnai-Fehér.
Approximately three years ago, a magical learning-based
algorithm appeared that was capable of translating
a photorealistic image of a zebra into a horse
or the other way around, could transform apples
into oranges, and more.
Later, it became possible to do this even
without the presence of a photorealistic image,
where all we needed was a segmentation map.
This segmentation map provides labels on what
should go where, for instance, this should
be the road, here will be trees, traffic signs,
other vehicles, and so on.
And the output was a hopefully photorealistic
video, and you can see here that the results
were absolutely jaw-dropping.
However, … look!
As time goes by, the backside of the car morphs
and warps over time, creating unrealistic
results that are inconsistent, even on the
short term.
In other words, things change around from
second to second, and the AI does not appear
to remember what it did just a moment ago.
This kind of consistency was solved surprisingly
well in a followup paper from NVIDIA, in which
an AI would look at the footage of a video
game, for instance, Pacman, and after it has
looked for approximately 120 hours, we could
shut down the video game, and the AI would
understand the rules so well that it could
recreate the game that we could even play
with.
It had memory and used it well, and therefore,
it could enforce a notion of world consistency,
or in other words, if we return to a state
of the game that we visited before, it will
remember to present us with very similar information.
So, the question naturally arises, would it
be possible to create a photorealistic video
from these segmentation maps that is also
consistent?
And in today’s paper, researchers at NVIDIA
proposed a new technique, that requests some
additional information, for instance, a depth
map that provides a little more information
on how far different parts of the image are
from the camera.
Much like the Pacman paper, this also has
memory, and I wonder if it is able to use
it as well as that one did.
Let’s test it out.
This previous work is currently looking at
a man with a red shirt, we slowly look away,
disregard the warping, and when we go back…hey!
Do you see what I see here?
The shirt became white.
This is not because the person is one of those
artists who can change their clothes in less
than a second, but because this older technique
did not have a consistent internal model of
the world.
And now, let’s see the new one.
Once again, we start with the red shirt, look
away, and then…yes!
Same red to blue gradient.
Excellent!
So it appears that this new technique also
reuses information from previous frames efficiently,
and therefore, it is finally able to create
a consistent video, with much less morphing
and warping, and even better, we have this
advantageous consistency property where if
we look at something that we looked at before,
we will see very similar information there.
But there is more.
Additionally, it can also generate scenes
from new viewpoints, which we also refer to
as neural rendering.
And as you see, the two viewpoints show similar
objects, so the consistency property holds
here too.
And now, hold on to your papers, because we
do not necessarily have to produce these semantic
maps ourselves.
We can let the machines do all the work by
firing up a video game that we like, request
that the different object classes are colored
differently, and get this input for free.
And then, the technique generated a photorealistic
video from the game graphics.
Absolutely amazing.
Now note that it is not perfect, for instance,
it has a different notion of time as the clouds
are changing in the background rapidly.
And, look!
At the end of the sequence, we got back to
our starting point, and the first frame that
we saw is very similar to the last one.
The consistency works here too.
Very good.
I have no doubt that two more papers down
the line, and this will be even better.
And for now, we can create consistent, photorealistic
videos even if all we have is freely obtained
video game data.
What a time to
be alive!
Thanks for watching and for your generous
support, and I'll see you next time!
