[MUSIC PLAYING]
JOSH MCDERMOTT:
Artificial neural networks
have recently emerged as leading
models of sensory systems.
When appropriately
optimized, these models
perform tasks like
speech recognition
and object classification
about as well as humans
and exhibit similar
patterns of behavior.
And the feature
spaces that they learn
can be used to predict brain
activity substantially better
than previous models.
JENELLE FEATHER: A critical part
of the representational power
of contemporary neural
networks is their invariance.
They instantiate
nonlinear functions
that map many distinct
stimulus examples
onto the same category, thus
achieving robust recognition
abilities.
In this work, we investigated
whether the invariances that
are learned by deep
neural networks
actually match human
perceptual invariances.
We found sets of stimuli that
the network said were the same,
and asked if humans are
also able to perceive
these stimuli as the same.
We called these
stimuli model metamers,
stimuli that are
physically distinct,
but that are perceived to
be the same by a model.
JOSH MCDERMOTT: So the
basic logic is simple.
If we have a good model of
some aspect of perception, say,
speech recognition, then if we
pick two sounds that the model
judges to be the
same, a human listener
when presented with
those two sounds
should also judge
them to be the same.
If, instead, they judge
them to be different,
that indicates a
clear difference
between the representations
in the model
and those in human perception.
JENELLE FEATHER:
In our paper, we
evaluated both visual and
auditory neural networks.
These stimuli are generated
by first measuring the model
activations for a
particular natural stimulus,
such as an image or sound.
We then take a noise stimulus
and use optimization tools
to modify this noise
input until, eventually,
the activations
for the noise match
those of the natural stimulus.
We now considered this optimized
noise to be our model metamer.
The model metamer
is also classified
as the same category as
the natural stimulus,
even though the
input can be very
different from the original.
JOSH MCDERMOTT:
Now, previous work
has used related optimization
tools to invert neural network
representations,
but has always used
priors that constrain
the resulting
signals to be naturalistic.
So from the standpoint
of using this as a tool
to evaluate these models as
models of human perception,
those priors could
actually mask differences
that might exist
between the model
and the human representations.
And so when we generated
model metamers,
they were constrained
only by the activations
in the neural network.
JENELLE FEATHER: We generated
these model metamers
for three different
image-trained architectures.
You can see, from looking
at the image demos,
that the model metamers
generated from the late model
stages are completely
unrecognizable for all
of the tested models.
To quantify the extent to
which the model representations
actually match those
of human perception,
we ran a human
behavioral experiment
where participants had to
classify the natural image
and the model metamers.
As you may expect from having
seen the example images,
humans can recognize
the model metamer
when it is matched to early
stages of the network.
But they're completely
unable to recognize it
when it is matched to
the late model stages.
We saw the same
trend in networks
trained to recognize speech.
Model metamers matched to
late stages of the network
are unrecognizable.
[STATIC]
Not only do listeners
get the word incorrect,
but the metamer doesn't
sound like speech at all,
further suggesting that the
network representations don't
line up with human
representations.
JOSH MCDERMOTT: Having
obtained this result,
we dug a bit deeper
to try to understand
the origins of the
model metamer failures
for our audio-trained networks.
We explored the effect
of the particular task
the models trained on and
the model architecture,
and we found some
modifications that
increased the recognizability
of the model metamers to humans.
This gives us some hope
that we may eventually
be able to develop models that
pass the metamer test and that,
thus, better capture
the invariances
of human perception.
JENELLE FEATHER: Model
metamers demonstrate
a significant failure of
present-day neural networks
to match the invariances in
the human visual and auditory
systems.
We hope that this
work will provide
a useful behavioral
measuring stick
to improve model representations
and create better models
of the human sensory systems.
[MUSIC PLAYING]
