Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér.
In a world where learning-based algorithms
are rapidly becoming more capable, I increasingly
find myself asking the question: “so, how
smart are these algorithms, really?”.
I am clearly not alone with this.
To be able to answer this question, a set
of tests were proposed, and many of these
tests shared one important design decision:
they are very difficult to solve for someone
without generalized knowledge.
In an earlier episode, we talked about DeepMind’s
paper where they created a bunch of randomized
mind-bending, or in the case of an AI, maybe
silicon-bending questions that looked quite
a bit like a nasty, nasty IQ test.
And even in the presence of additional distractions,
their AI did extremely well.
I noted that on this test, finding the correct
solution around 60% of the time would be quite
respectable for a human, where their algorithm
succeeded over 62% of the time, and upon removing
the annoying distractions, this success rate
skyrocketed to 78%.
Wow.
More specialized tests have also been developed.
For instance, scientists at DeepMind also
released a modular math test with over 2 million
questions, in which their AI did extremely
well at tasks like interpolation, rounding
decimals, integers, whereas they were not
too accurate at detecting primality and factorization.
Furthermore, a little more than a year ago,
the Glue benchmark appeared that was designed
to test the natural language understanding
capabilities of these AIs.
When benchmarking the state of the art learning
algorithms, they found that they were approximately
80% as good as the fellow non-expert human
beings.
That is remarkable.
Given the difficulty of the test, they were
likely not expecting human-level performance,
which you see marked with the black horizontal
line, which was surpassed within less than
a year.
So, what do we do in this case?
Well, as always, of course, design an even
harder test.
In comes SuperGLUE, the paper we’re looking
at today, which is meant to provide an even
harder challenge for these learning algorithms.
Have a look at these example questions here.
For instance, this time around, reusing general
background knowledge gets more emphasis in
the questions.
As a result, the AI has to be able to learn
and reason with more finesse to successfully
address these questions.
Here you see a bunch of examples, and you
can see that these are anything but trivial
little tests for a baby AI - not all, but
some of these are calibrated for humans at
around college-level education.
So, let’s have a look at how the current
state of the art AIs fared in this one!
Well, not as good as humans, which is good
news, because that was the main objective.
However, they still did remarkably well.
For instance, the BoolQ package contains a
set of yes and no questions, in these, the
AIs are reasonably close to human performance,
but on MultiRC, the multi-sentence reading
comprehension package, they still do OK, but
humans outperform them by quite a bit.
Note that you see two numbers for this test,
the reason for this is that there are multiple
test sets for this package.
Note that in the second one, even humans seem
to fail almost half the time, so I can only
imagine the revelation we’ll have a couple
more papers down the line.
I am very excited to see that, and if you
are too, make sure to subscribe and hit the
bell icon to not miss future episodes.
Thanks for watching and for your generous
support, and I'll see you next time!
