Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér.
This work is about creating an AI that can
perform audio-visual correspondence.
This means two really cool tasks:
One, when given a piece of video and audio,
it can guess whether they match each other.
And two, it can localize the source of the
sounds heard in the video.
Hm-hmm!
And wait, because this gets even better!
As opposed to previous works, here, the entire
network is trained from scratch and is able
to perform cross-modal retrieval.
Cross-modal retrieval means that we are able
to give it an input sound and it will be able
to find pictures that would produce similar
sounds.
Or vice versa.
For instance, here, the input is the sound
of a guitar, note the loudspeaker icon in
the corner, and it shows us a bunch of either
images or sounds that are similar.
Marvelous.
The training is unsupervised, which means
that the algorithm is given a bunch of data
and learns without additional labels or instructions.
The architecture and results are compared
to a previous work by the name Look, Listen
& Learn that we covered earlier in the series,
the link is available in the video description.
As you can see, both of them run a convolutional
neural network.
This is one of my favorite parts about deep
learning - the very same algorithm is able
to process and understand signals of very
different kinds: video and audio.
The old work concatenates this information
and produces a binary yes/no decision whether
it thinks the two streams match.
This new work tries to produce number that
encodes the distance between the video and
the audio.
Kind of like the distance between two countries
on a map, but both video and audio signals
are embedded in the same map.
And the output decision always depends on
how small or big this distance is.
This distance metric is quite useful: if we
have an input video or audio signal, choosing
other video and audio snippets that have a
low distance is one of the important steps
that opens up the door to this magical cross-modal
retrieval.
What a time to be alive!
Some results are very easy to verify, others
may spark some more debate, for instance,
it is quite interesting to see that the algorithm
highlights the entirety of the guitar string
as a sound source.
If you are curious about this mysterious blue
image here, make sure to have a look at the
paper for an explanation.
Now this is a story that we would like to
tell to as many people as possible.
Everyone needs to hear about this.
If you would like to help us with our quest,
please consider supporting us on Patreon.
You can also pick up some cool perks, like
getting early access to these videos or deciding
the order of upcoming episodes.
Details are available in the video description.
Thanks for watching and for your generous
support, and I'll see you next time!
