Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér.
With today's camera and graphics technology,
we can enjoy smooth and creamy videos on our
devices that were created with 60 frames per
second.
I also make each of these videos using 60
frames per second, however, it almost always
happens that I encounter the paper videos
that have anything from 24 to 30 frames per
second.
In this case, I put them in my video editor
that has a 60 fps timeline, so half or even
more of these frames will not provide any
new information.
As we try to slow down the videos for some
nice slow-motion action, this ratio is even
worse, creating an extremely choppy output
video because we have huge gaps between these
frames.
So, does this mean that there is nothing we
can do and have to put up with this choppy
footage?
No, not at all!
Earlier, we discussed two potential techniques
to remedy this issue.
One was frame blending, which simply computes
the average of two consecutive images and
presents that as a solution.
This helps a little for simpler cases, but
this technique is unable to produce new information.
Optical flow is a much more sophisticated
method that is very capable as it tries to
predict the motion that takes place between
these frames.
This can kind of produce new information and
I use this in the video series on a regular
basis, but the output footage also has to
be carefully inspected for unwanted artifacts.
Which are a relatively common occurrence.
Now, our seasoned Fellow Scholars will immediately
note that we have a lot of high-framerate
videos on the internet, why not delete some
of the in-between frames, give the choppy
and the smooth videos to a neural network,
and teach it to fill in the gaps!
After the lengthy training process, it should
be able to complete these choppy videos properly.
So, is that true?
Yes, but note that there are plenty of techniques
out there that already do this, so what is
new in this paper?
Well, this work does that, …and… much
more!
We will have a look at the results, which
are absolutely incredible, but to be able
to appreciate what is going on, let me quickly
show you this.
The design of this neural network tries to
produce four different kinds of data to fill
in these images.
One is optical flows, which is part of previous
solutions too, but two, it also produces a
depth map that tells us how far different
parts of the image are from the camera.
This is of utmost importance, because if we
rotate the camera around, previously occluded
objects suddenly become visible, and we need
proper intelligence to be able to recognize
this and to fill in this kind of missing information.
This is what the contextual extraction step
is for, which drastically improves the quality
of the reconstruction, and finally, the interpolation
kernels are also learned, which gives it more
knowledge as to what data to take from the
previous and the next frame.
Since it also has a contextual understanding
of these images, one would think that it needs
a ton of neighboring frames to understand
what is going on, which, surprisingly, is
not the case at all!
All it needs is just the two neighboring images.
So, after doing all this work, it better be
worth it, right?
Let’s have a look at some results!
Hold on to your papers, and in the meantime,
look at how smooth and creamy the outputs
are!
Love it!
Because it also deals with contextual information,
if you wish to feel like real Scholar, you
can gaze at regions where the occlusion situation
changes rapidly and see how well it fills
in this kind of information.
Unreal.
So how does one show that the technique is
quite robust?
Well, by producing and showing it off on tons
and tons of footage - and that is exactly
what the authors did!
I put a link to a huge playlist with 33 different
videos in the description so you can have
a look at how well this works on a wide variety
of genres.
Now, of course, this is not the first technique
for learning-based frame interpolation, so
let’s see how it stacks up against the competition!
Wow, this is quite a value proposition, because
depending on the dataset, it comes out first
and second place on most examples.
The PSNR is the peak signal to noise ratio,
while the SSIM is the structural similarity
metric, both of which measure how well the
algorithm reconstructs these details compared
to the ground truth, and both are subject
to maximization.
Note that none of them are linear, therefore
even a small difference in these numbers can
mean a significant difference.
I think we are now at a point where these
tools are getting so much better than their
handcrafted optical flow rivals that I think
they will quickly find their way to production
software.
I cannot wait.
What a time to be alive!
This episode has been supported by Weights
& Biases.
In this post, they show you which hyperparameters
to tweak to improve your model performance.
Weights & Biases provides tools to track your
experiments in your deep learning projects.
Their system is designed to save you a ton
of time and money, and it is actively used
in projects at prestigious labs, such as OpenAI,
Toyota Research, GitHub, and more.
They don’t lock you in, and if you are an
academic or have an open source project, you
can use their tools for free.
It really is as good as it gets.
Make sure to visit them through wandb.com/papers
or just click the link in the video description
and you can get a free demo today.
Our thanks to Weights & Biases for their long-standing
support and for helping us make better videos
for you.
Thanks for watching and for your generous
support, and I'll see you next time!
