[MUSIC]
>> Today, we're lucky to have
Paul Smolensky here to give
a presentation for us.
Paul is a professor of cognitive
science at Johns Hopkins,
and also a partner researcher
in the deep learning
team here at MSR AI.
So as such, he's perfectly suited
to give us this nice overview
of the history of thoughts of
human cognition and approaches,
and how the pendulum
has moved between
symbolic and neural,
and back and forth.
I know his current work is on
looking at how you can merge them
to create these neural
symbolic approaches
that hopefully take the
best of both worlds.
I assume we'll learn a
bit more about as well,
but with that said, Paul,
why don't you go and take it
away. You're muted right now.
>> Are you seeing my screen now?
>> Yeah. We can see your
screen and hear you.
>> Okay. Thanks a lot.
Let's get started then.
In the beginning, models of
intelligence use representations that
work discrete symbol structures.
I manipulated them with
discrete re-write rules like
those of logic programmer.
But this wasn't invented
in the 20th century,
it was invented around 400 BCE,
as we'll see soon.
But anyway, in the mid 1980s,
the field took a huge swing.
The Parallel Distributed
Processing group,
among others, showed
how numerical vectors,
listing the activation values
of groups of model neurons,
could be used as powerful
ways of representing
information for
intelligent processing.
The 1990s saw another swing back
to symbolic representations,
which were now assigned
probabilities rather than
manipulated by rules.
In the 2000s, this pendulum
swung again back to
numerical vector representations
designed from data by
deep learning. What's going on here?
Make up the field, make up it's
mind on such a
fundamental distinction
between using symbolic computation
or neural computation.
To resolve this, I propose we follow
advice from my mother,
who used to say,
allegedly quoting Yogi Berra,
"When you get to the fork
in the road, take it."
Clearly both types of
computation are essential,
and this is the moral
of these lectures.
The pendulum will finally stop
swinging only when the field
rises up and adopts
neurosymbolic computation.
To really understand
what's going on here,
I think we need to take a
historical perspective.
AI inherits much of
its intellectual technology from
centuries of the study
of human intelligence,
which in the recent era goes
by the name cognitive science.
Cognitive science in
turn inherits much
of its intellectual technology from
computer science by adopting
the view that what we
have in our heads is
really a computer.
But what type of computer
do we have in our heads?
One built with symbolic
computation or neural computation?
In this and next
week's presentations,
I'm not wearing my MSR
AI researcher hat,
but my professor of cog sci hat.
When I told Henry [inaudible]
about my favorite course,
foundations of cognitive science,
he asked if I'd consider giving
a two-hour synopsis of it here.
Unfortunately, the course
is entirely based on
class discussion of over 30
readings and there are no lectures.
I've created these lectures just for
the occasion and they have
not been field tested.
So please be kind and keep in
mind that I'm trying to compress
32 hours of class
time into two hours.
This requires
oversimplifying and skipping
lightly over many deep
and complex issues.
So please bear with me and
take these lectures as
a view of this complex
material from outer space.
It would be helpful if you
submitted your questions by
chat or held them to
the question period.
Before proceeding, I want to
emphasize that the history
of cognitive science,
or of AI, is not for me a field
of research or expertise.
My goal here is only to put a lot
of ideas on the table for all of us,
especially interns new to the field,
to think about and perhaps go
investigate later in some detail.
I'll raise a lot of questions,
give a lot of proposed answers,
but not favor any particular answers,
except maybe at the
very end next week.
Why do I think it is crucial to
develop neurosymbolic
architectures for AI?
To get started in the cognitive
science perspective on this question,
let's start with a very
concrete case study,
learning word formation, specifically
the English past tense
and the German plural.
Debates about this in
cog sci have been raging
on and off since the
birth of the field,
and continue to this day,
the most recent round
laying out in ACL.
For a 2018 paper in
the Transactions of the Association
for Computational Linguistics,
Chris Kirov and Ryan Coterell built
a modern-day LSTM sequence to
sequence model to map an English
verb stem to its past tense form.
This includes regular
verbs like pick,
and pig, and pit.
Now, the past tense of
all these is written ed,
but pronounced in three
different ways as shown
here, picked, pigged, pitted.
The task also includes
irregular verbs that
don't follow these rules,
like come goes to came.
Kirov and Cotterell
argued that their model
provides a good account of
human performance on this task,
succeeding where
first-generation models
failed at correctly producing
irregular past tenses.
Then at ACL 2019, Maria Corkery,
Yevgen Matusevych, and Sharon
Goldwater argued that,
well, actually it doesn't.
For made-up words,
production probabilities of
alternative past tense forms do
not correlate well with humans.
Just this month, Kate
McCurdy, Goldwater,
and Adam Lopez built a
corresponding model of the German
plural and argued that it too
failed to learn like humans.
Now, this big debate
dates back to at least
the 1960s when Susan Ervin
documented that children first
successfully produce came,
then started to say comed.
She argued that because
they've never heard comed,
they must have generated it.
Clearly, they learned the
regular rule and were
now mistakenly applying it to come.
In this era, learning language
meant learning rules.
But one of the forces that
swung the pendulum over to
neural representations
was a chapter in
what became the bible
of a new generation of
Neural Network Monitors,
the PDP books.
Dave Rumelhart and J.
McClelland argued that
a simple neural network model
shows the same learning
trajectory as kids,
even though it is obviously not
learning any symbolic
linguistic rules.
Now, language learning
became learning
the statistics data from
the training environment.
In the legendary reply,
Steven Pinker and Alan
Prince argued that
the PDP neural models suffered
from a dozen serious flaws.
They could all be
blamed specifically on
the absence of symbolic
representations and rules.
They advocated a dual
view in which kids have
one system that learns a symbolic
rule for the regular cases,
and a separate neural network
system for the exceptions.
Another player in this
debate was Gary Marcus,
who you may know as a contemporary
critic of deep-learning.
Marcus's career began with
this follow-up study
with Pinker and others,
arguing that the only reason
the PDP Model to learn
anything even remotely resembling
the regular past tense
rule of English,
is that it is an extremely
frequent pattern,
where the regular rule
is very infrequent.
As they argued was true
for the German plural,
PDP models would have
no hope of success.
This year's ACL paper by
McCurdy et al essentially
argues that Marcus, Pinker,
and colleagues were right,
not just about 1980s PDP models,
but even 2020 deep-learning models.
The debate over whether
language learning meant
learning symbolic rules,
or learning statistics,
or the training data,
raged vigorously for years.
Here are 14 examples of some of
the subsequent modeling papers
addressing this debate.
Now, from the very concrete
case of English past tense,
let's zoom way out on
the debate over whether
human cognition is symbolic,
or neural computation,
and take advantage of a very
helpful geographic metaphor here,
that was illustrated in a figure
in a paper of Dan Dennett's,
which I've augmented a bit.
At the center of this plot is the
eminent philosopher Jerry Fodor,
who when at MIT called
approaches to cognitive science,
West Coast, except those at MIT,
which formed the East pole.
Every direction away
from MIT goes West.
The East pole view is
anchored in philosophy by
Fodor and linguistics
by Noam Chomsky,
and in psychology by Liz Spelke,
and Sue Carey, and Steven Pinker,
all having spent key parts
of their careers at MIT.
The east pole is quite unified,
well-defined, and coherent.
We'll start by focusing on this view.
In contrast, spread around
the West Coast and filling
in the Midwest are
a huge variety of views
that have nothing in common
except that they all
disagree with the East pole.
Each in a different way.
Time permitting, I'll mention the
green names I've added as well
as 10 of red in its
originals in black.
Either way, there's still
a correlation between
all the East pole views and
actually working in the Eastern US,
but the geography here is intended
to be purely metaphorical.
In this slide, I'm going to give
a crude timeline of some
of the key developments,
most of which I'll be
explaining shortly.
Please bear with me as I
run through this quickly.
The East Pole view in
linguistics was articulated by
Chomsky in the '60s and then
experimentally supported
by psychologists.
The authors shown in red
on this slide will be
briefly discussed in the story
I'll be telling in these lectures.
For his views on the cognitive
universality of grammar,
Chomsky credited the work of
the Port-Royal Grammarians in
France, three centuries earlier.
He called his approach
Cartesian Linguistics
because of its roots in the
work of Rene Descartes.
We will start the story of the
so-called rationalist tradition,
was the work of Plato in
Greece 2000 years earlier.
But as we're about to
see rationalists thought
was not just to Western construct,
and although I unfortunately
won't discuss them much,
I just wanted to mention important
non-European traditions too.
Now, while Chomsky's own
work focused on syntax,
his work with Moore Halley founded
the modern era of phonology,
the grammar of sound
structure in language.
I mentioned this because
this approach was in fact
pioneered in ancient India by Panini.
Panini's formal rule-based
grammar of Classical Sanskrit,
is still regarded by many as
the most complete generative
grammar ever developed.
Meanwhile in the AI
sector of the East Pole,
John McCarthy's logic School of AI,
was the culmination of millennia
of formal work on logic,
including many important
developments such as
the Port-Royal Logic and
even Panini's Logic,
which he developed for
his grammatical work.
Other non-Western founders of
formal logic in the ancient world
included Gautama in
India and Mozi in China.
Back in Greece, hugely
influential in the West
for his work on logic,
but also many other areas,
was Plato's student, Aristotle,
whose use became undeniable truths in
Europe for nearly 2000 years.
Likewise, the roots of
many West Coast views
go back centuries as well.
The PDP approach to neural
models of intelligence,
also known as connectionism,
follows philosophical
empiricism, developed in
the 17th century by
the British empiricists such
as John Locke and David Hume.
Connectionism also draws from
more recent work in neuroscience,
especially the work of
Donald Hebb in the 1950s.
Opposed to Chomsky in linguistics,
cognitive linguistics draws
from the experimental
cognitive psychologists work
such as that of Eleanor Rosch.
Cognitive psychology
was itself a reaction
against another West
Coast view, Behaviorism,
in which B. F Skinner claimed that
human cognition, even language,
could be explained using the
concepts he developed to explain
the behavior of pigeons in
simple conditioning experiments.
Finally, we zoom out in
cognitive theory space,
which is defined by many dimensions
along which theories differ.
We just implicitly
discussed one-dimension,
does the theory follow
the empiricist view that
all the knowledge
derives from experience,
or the rationalist view that much of
our most important knowledge
is prior to experience innate?
Here are other
important distinctions.
Today I'll quickly discuss
several of those in green.
Now, these are all distinctions
pertaining to cognition itself.
There are also important distinctions
about how cognitive science
should be practiced.
These are issues in the
philosophy of science,
including these eight
or so issues here.
Now the plan for these lectures is to
take up these issues in sequence.
As you can see, as always,
I've packed in more material than
anyone can absorb in one hour.
You may want to tune
in and out a bit to
conserve attention for the
parts that most interest you.
We'll start now with perhaps the
most fundamental issue. This one.
What is cognitive
science the science of?
Brain dynamics or behavior
or mental knowledge?
The East Pole position
here is emphatic.
Even though it is brain dynamics
and behavior that are measurable,
cognitive science should be
the study of mental knowledge,
which is not directly observable
through sensory perception.
To understand this somewhat
paradoxical position,
we need to view it from the
perspective of rationalism.
Plato emphasize that perception
cannot provide knowledge.
Plato's scorn for
experience is a preview of
Chomsky's rejection of corpus
data for studying language.
Why can perception not
provide knowledge?
Well, consider that having
exactly equal length
is a concept we know,
yet we can never
perceive this sincerely.
Where generally, the most
certain knowledge we have is
about what we can't
perceive, mathematics.
In his famous Parable of the Cave,
Plato asks us to imagine a
group of prisoners chained for
life so that all they've ever
seen is the wall of the
cave in front of them,
or they see their shadows
cast by the fire behind them.
They believe the shadows are
the reality of their world.
Although we know they're merely
images projected by the real objects.
Plato says that our own perception of
the world is likewise not reality,
but merely the shadows
cast upon us by
the reality that he calls
the world of ideas.
Since knowledge cannot
rise some perception,
it has to come from somewhere
else and Plato posits that it
comes from recollection
in past lives.
Now this is pretty untenable,
but it does anticipate Chomsky's view
that our deepest knowledge
of language is innate,
not learned from experience.
Plato also believed that the
true student of astronomy is
not to be troubled too much about
the actual heavenly bodies,
but with the mathematics of the
motion of ideal heavenly bodies.
Make sure to read Chomsky's
notion of linguistic competence,
which cannot be observed
in language behavior.
Last 42,000 years preeminent
rationalist Rene Descartes,
who philosophers call the
Founder of Modern Philosophy,
argued that the only
thing whose existence we
can be certain of is our own mind.
All else can be doubted.
Perhaps, what we think is real
outside our heads, is just a dream,
or perception deceived by an
evil demon, like The Matrix.
The central pillar of
rationalism is that,
what's outside our heads is
a poor cousin of what's
inside our heads.
For our second key dimension
of theory space, we ask,
should cognitive
science seek theories
that explain the details of
actual human performance,
or theories of competence,
the knowledge underlying performance,
ignoring the vagaries of
particular instances of
use of this knowledge?
As we already so anticipated
in this remark of Plato,
the East Pole position
is that we need to
find out what the underlying
knowledge is first.
Only then can we study how
the knowledge is acquired by
learning algorithms and used
by processing algorithms.
Now in the case of language,
this was Chomsky's position,
of course, how does it
actually play out for Chomsky?
Well, Chomsky says that knowledge of
language competence consists
primarily of two things.
First, the mind's ability to
construct representations of
linguistic expressions
like sentences or words,
symbol structures like these.
I hope you'll forgive
my laziness in using
my whiteboard drawings from class.
On the left, is the parse
tree of the noun phrase,
"The man who I think that
you said that you had seen."
The arrows show movements of
symbols between positions.
On the right, is the mental
representations of the noun phrase,
"The man who I wonder
which scone preferred."
The second component of competence
is knowing principles that
determine which structures
are well-formed.
These principles define the grammar.
For example, one such
principle is subjacency,
which says that each step
of movement may not cross
more than one bounding node.
The set of bounding nodes is a
parameter in this principle.
Here's how this parameter explains
cross-linguistic differences.
As you would have noticed
the structure on the right,
"The man who I wonder which scone
preferred," is not
acceptable English,
marked conventionally by a star.
This is because on
the Chomskyan view,
the set of bounding nodes in English
includes the S sentence node,
and this step of movement
here crosses one,
and then a second S node,
violating the principle.
Now astonishingly,
the Italian version of the
same phrase is acceptable.
Why? Because the parameter has
a different setting in Italian.
S is not a bounding node,
in Italian, S-prime is.
This is the only S-prime node here,
that's crossed by this movement.
It does not violate the principle in
Italian the way it does in English.
Chomsky's view knowledge
of principles is innate.
These principles structure how
children analyze their
linguistic input.
They know they need to
observe which nodes can
be crossed by movement
in their language,
so they can learn the
value of the parameter,
the set of bounding nodes for
their particular language.
We'll come back to this in
a moment when discussing
an annual construct [inaudible]
of nativism and empiricism.
As we just saw for Chomsky,
the data of language or
expressions labeled as
"well-formed" or "ill-formed",
The task of the linguist,
generative grammarian,
is to logically derive
the labels of expressions
from the principles.
These are theorems that
require symbolic proof.
What is the state of the art
on this task? You might ask.
Well, it's provided by theories
of theoretical linguistics,
and models from classic, symbolic
computational linguistics.
Not by neural models,
whether they be NLP or
connectionist cognitive models.
These models do not
perform this task.
They do not symbolically
deduce the data
from principles or
axioms of a theory.
They are categorically
banned from the leaderboard.
Could neural nets possibly
have innate knowledge of
principles of universal
grammar such as subjacency?
Well, as it happens at the Cognitive
Science Society Conference
this week, Tom McCoy,
currently an MSR Summer Intern,
and the curator of questions
during this very talk,
we will show how this can be
done through meta-learning.
Now, we're ready for
another big dimension
of cognitive theory space.
Is the mind primarily a device for
processing highly structured
symbolic information,
or for performing statistical
analysis of experience?
As we just saw in Chomsky's work,
the East Pole position
is that the mind is
principally a symbolic
structured processor.
For this issue to the East
Pole roots lie in antiquity.
Aristotle was a founder of
deductive logic in the West.
One of his syllogisms
asserts, for example,
that if we know that some As are Bs,
and that all Bs are Cs,
then we can deduce
that some As are Cs.
The point, is that the
content of A, B, and C,
what they mean is irrelevant,
only the structure of
this information matters.
This is pure meaning free symbol
manipulation 2,000 years ago.
Even before Aristotle in India,
[inaudible] used pure
symbol manipulation to
do the task of generative grammar
that we saw on the previous slide.
Most of the East Pole or
West Coast approaches that
takes statistical analysis as
the primary computation in cognition.
The historical roots
of this view date
back to the British empiricists,
as I've already mentioned.
The empiricist emphasis on
statistical computation derives from
their position on the
origin of knowledge.
Whereas the rationalist focus on
knowledge generated
internally, empiricists,
including many West Coast approaches,
pick the view that all knowledge is
extracted from the external
world through experience.
John Locke maintained that
all knowledge starts from
perception and is built up by
associating together
sensations that are similar
in their content or in
their time of perception.
The empiricists did
amazing PR job and
convinced the entire English-speaking
world of their view.
Now in our community,
there's virtually
universal acceptance of
the view that our minds
start as blank slates.
We can see it all the time
in the eye [inaudible] ,
stated with absolutely no evidence,
it's obviously true,
just common sense.
But there are grave
problems with this
actually, and here are just two.
That our perceptions are caused
by events outside our heads.
We can't perceive outside
events causing our perceptions.
So we can't know that they do,
if all the knowledge
derives from perceptions.
We're back to decanting the
problem with the matrix.
Second, how can we justify
the belief that past correlations
predict future ones?
What good is it to know
that past correlations,
what they ask relations are if
they don't inform us
about the future.
But the only evidence we
have is that in the past,
past correlations predicted
what were then future ones.
But that's all in the past.
We can't justify believing that
what is now with our future,
will be statistically like the past.
Of course, inference to future
predictions is uncertain,
but there's simply no basis
for even that form of
inference that can be extracted
from the experience alone.
These are unsolvable
problems for empiricism.
In fact, the great empiricists,
David Hume himself just
gave up on causation.
Concluding that there
is only correlation and
our minds just have a habit
of thinking that A causes B,
when the correlations
between them are right.
Now as we've seen, the East
Pole position contraries
that much crucial knowledge is
not extracted from experience.
Chomskian innate knowledge
of linguistic principles,
was anticipated by Plato's conclusion
that knowledge must
arise from recollection,
since it can't arise from experience.
Aristotle syllogisms
enables new belief
to be rationally generated
internally to the mind.
No external input needed.
Although this form of knowledge,
it's purely tautological,
essentially true by definition.
Immanuel Kant, provided a synthesis
of empiricism and rationalism,
which I think comes closest to
the right view in
cognitive science today.
That we are born with some
important innate knowledge,
for how to impose organization
on our experience.
So we can best learn from him.
For Kant, this included
innate knowledge of a
four-dimensional Euclidean world,
which allows us to organize our
perception of objects and events.
Innate knowledge that
calls or relations
exist among external events.
Innate knowledge that our perceptions
are caused by external events.
By Chomsky recall, how
innate knowledge of
this adjacency principle
allows children to
organize their linguistic experience
so as to be able to learn the
specific parameter values,
characteristic of their
particular language.
This is very much the way Kant saw
innate knowledge interacting
with experience.
The study of innate knowledge is
now highly experimental actually.
The field of cognitive
development was revolutionized in
the 1960s by methods for inferring
knowledge inside the infants heads,
by measuring the time
they spent looking
at different types of events.
Four to six month-old
babies distinguish between
event relations that we consider
causal and those we consider
merely correlation.
They respond differently
if one object is
set in motion upon
collision with another.
As you can see my hands,
they respond differently
to events like this.
Then, when an event
is set into motion by
another object that comes close and
stops without making
content, like this.
The correlational structure
is exactly the same.
But one we regard as
causal and the other not,
and infant's make the same
distinction among events,
at least by the age of four months.
These methods have
also revolutionized
the study of animal cognition,
and there the study of innate
knowledge is informative
about knowledge that evolution
might have provided humans.
Spelke Core summarized her
extremely influential theory
of poor knowledge systems this way.
''This research suggests
that the human mind is not
a single general purpose
device that adapts itself to
whatever structures and challenges
the environment affords.
Humans learn something's readily,
and others with greater difficulty,
by exercising more
specific cognitive systems
with signature properties and limits.
The human mind also
does not appear to be
a 'massively modular' collection of
hundreds or thousands of
special-purpose cognitive devices.
Rather, the mind
appears to be built on
a small number of core systems.
These systems are knowledge
of inanimate objects and their
mechanical interactions.
Knowledge of animate agents and
their goal-directed actions,
knowledge about sets and
the numerical relationships of
ordering, addition, subtraction,
knowledge about places in
the spatial layout and their
geometric relationships,
and finally, knowledge for
identifying and reasoning
about potential social partners
and social group members."
No matter how you look at it,
this work presents a
major challenge to
contemporary AI based on deep
learning and neural networks.
Either: such abstract
knowledge systems are
innate or they are
learnable, unsupervised,
from the infinitesimal
quantity of input,
that infants received compared to
the gargantuan training decks
that are now required
by deep learning.
What AI needs to do,
argues Gary Marcus,
is to abandon purest empiricism
and study the proper use
of innate knowledge,
like that which makes the
highly aid efficient learning
of infants possible.
On the other side, there's
an important challenge too,
a challenge to the East pole view
that comes from a surprising place.
We return to our first issue,
"What is cognitive
science the science of?"
and now consider the
second option, behavior.
Behaviorism is a West Coast
perspective that dominated
experimental psychology in the US
for the first half
of the 20th century.
Strangely this, in fact,
made a comeback with
a strong presence in
contemporary neuroscience.
B.F. Skinner argued that
psychology should be
the study of behavior,
and that we should explain
behavior from the history of
positive and negative reinforcement
experienced by the subject.
On the other side, nearly all of
modern cognitive
science says behavior
should be explained by inner causes,
states of the mind,
beliefs, goals, and so on.
But Skinner argued
that explanation by
inner causes has three fatal flaws.
First, it is unnecessary.
If inner states are
lawfully determined by
reinforcement history and behavior
is lawfully determined
by inner states,
then behavior is lawfully
determined by reinforcement alone.
We can just cut out the
middleman, inner causes.
Second, explanation by inner
causes is unfalsifiable,
fatal for any states because
inner causes are not observable.
The third explanation by
inner causes is circuit.
How does it explain why
a horse is drinking?
Because of the inner cause of thirst.
How do we know the horse is thirsty?
Because it is drinking, completely
useless circular reasoning.
Now I think these are
serious accusations to
which cognitive science must
have very secure defenses.
My own, in a nutshell,
goes like this.
Hypothesizing inner causes
is necessarily for steps
1 and 2 of Skinner's argument
are intractably complex,
and valid if the number
of hypothesized cause is
small relative to the number
of observable predictions
that they make.
This can be formalized, I think,
in computational learning theories
version of Ockham's razor.
When the dimension of
the hypothesis space
is much smaller than the
number of observations,
the probability of
correct generalization
to instant examples can be good.
Now Skinner's approach
to explaining language,
what he called verbal behavior
extended his philosophy
from pigeons to people.
Then Skinner wrote a book in 1959,
extending this to Verbal Behavior,
which was followed rapidly by
a review written by Chomsky,
in which he granted that
Behaviorism had proved
successful for explaining simple
behaviors of the organisms.
But that for human language,
what Skinner doing was
just play-acting in
science that behaviorists'
descriptions are merely
vague paraphrases of precise,
formalizable, mentalist
descriptions of inner causes.
Publication of this review was
one of the events that launched
him to the status of a superstar and,
in fact, to many, even a god.
Chomsky's sarcastic,
condescending tone in
this review would characterize
his style throughout his career.
For this reason, within
much of cognitive science,
reinforcement learning does
not have a good pedigree.
Now we shall take up an
issue that is central to
cognitive science across the board.
Should a cognitive theory
be a computational theory,
viewing the mind as a
machine that takes data in,
processes it according to the
algorithm, and outputs behavior?
On this issue,
nearly all cognitive scientists
agree that the answer is yes,
although we assume a
bit of an exception
in next week's discussion.
One of the earliest modern
expressions of this view
was Alan Turing's classic 1950 paper.
At the end of the paper, like many
AI writers that followed him,
he described an approach to
AI based on learning from
a blank slate simply asserting
that is how children learn.
Instead of trying to produce a
program to simulate the adult mind,
why not rather try to produce
one which simulates the child's?
Presumably, the child
brain is something like
a notebook as one buys
it from the stationers,
rather a little mechanism
and lots of blank sheets.
He then, again with no justification,
asserts that learning in children
is reinforcement learning.
We normally associate punishments and
rewards with the teaching process.
But putting this aside,
the important point of the paper for
our story is really
about something else.
The paper is really about the
question, can machines think?
He asserts that the question is
too meaningless to take seriously.
So he posses to replace that
with what he calls
the imitation game,
which has become what we now
know as the Turing Test.
Turing also began the
tradition of absurdly naive,
overoptimistic predictions about AI.
"I believe in about 50 years' time
it will be possible to program
computers with a storage capacity of
one megabyte to make them play
the imitation game so well
that an average interrogator
will not have more than 70
percent chance of making
the right identification after
five minutes of questioning."
The Turing test has, of course,
been enormously influential
and widely accepted within AI.
But from the perspective
of cognitive science,
it is seriously defective.
It is entirely behaviorist
appropriate for 1950.
It reduces intelligence to behavior
ignoring entirely what
goes on inside the system.
It replaces the scientific
search for a theory of
intelligence with the judgment
of an informed, naive human observer.
Does this systems behavior
seem intelligent to you?
Here's an analogy from physics.
Suppose we said, the
question, "Is an object hot?"
is too meaningless to take seriously.
We should put the
object in a closed box,
let an observer reach in
and touch the object,
and ask the observer, "Does
this seem hot to you?"
Of course, that's ridiculous.
We'd never do that for a
physical science the way
Turing proposes to do it
for a science of the mind.
Physics gives a theory of
temperature with techniques for
quantifying the hotness of an object,
laws that govern
temperature, and ultimately,
reducing being hot to a deep
internal property of the object,
the average kinetic energy of
the molecules that make it up.
So Turning essentially dismisses
entirely the possibility of
a science of intelligence.
This is absolutely not the case for
computer scientists who followed him.
The founders of AI
took their job to be,
to characterize the internal
computational properties
that make a system intelligent.
First of all, take up the
work of John McCarthy,
and then look at the work
of Newell and Simon.
These authors occupy a central
place at the East pole.
But before we discuss these
East pole AI pioneers,
let me point out that
contemporary deep learning
is a return to Turing's
West Coast Behaviorism.
Turing's behaviorist stance refusing
to look inside the system is
evident in contemporary AI because of
the difficulty of looking
inside neural networks.
Concepts from cognitive science
about system internal
properties are now, A,
routinely converted to
behaviorist concepts
that display the weaknesses
that Chomsky pointed out in
Skinner's attempt at
behaviorist linguistics.
So just as Turing replaces
intelligence with behavior
that fools people,
in deep learning, the term rationale,
which properly refers to
a rationalist logical derivation
of a systems response
from its internal knowledge is
coerced to mean input
that affects the output.
Input and the stimulus or
input from training data.
A hypothesized internal
property of cognitive systems,
now scientifically well-studied
in psychology and neuroscience,
has been coerced in deep
learning so that awareness of X,
as sometimes called X-aware.
What it means the behavior
of this system is
conditioned on X,
entirely behaviorist,
nothing about what we
actually know about
awareness could be any
further from relevant.
But now, let's return
to the founding of
AI and consider an early
work with John McCarthy,
who was credited with giving
the field its name in 1956.
Unlike Turing, McCarthy's
mission was all about
looking inside intelligence to see
what computations it requires.
An issue separating many
West Coast theories from
the East pole is whether
inference is a kind of logic.
McCarthy's these East pole
position was certainly, yes.
One of the earliest AI papers
was McCarthy's 1959 paper,
Programs with Common Sense,
in which he proposed a system
he called the advice-taker.
It gave an analysis of
the capabilities required
of an intelligent agent,
which were to acquire new
knowledge by inference,
process abstractions, and
to learn by being told.
His solution was that
knowledge representation
must be language-like,
must be structured
symbolic expressions.
This enables abstract
encoding in symbols,
storing immediate consequences,
which he called common
sense in logic,
assembling such inferences
into long chains,
and learning by being told.
To illustrate, here's one of
the premises he proposed
for common sense deduction.
What this formula means is
this: if the system knows that
X is true and knows that it
can ultimately make W true,
whenever X is true by
doing Y and then doing Z.
That it wants to make W true,
then it can infer
that it should do Y.
McCarthy's work was
enormously influential.
In AI, the logic school
flourished for decades,
filling up the journal,
Artificial Intelligence,
is a specialized
logic for every task.
In cognitive science, not so much.
As we'll see in next week's
episode in cognitive science,
connectionist and probabilistic
inference dominated.
Along with John McCarthy,
Alan Newell and Herb
Simon were founders of
AI in the 1950s and like McCarthy,
their approach to peaceful
positions on these issues.
The cognitive processes they studied
were conscious
problem-solving processes,
using explicit representations
of facts and rules,
and explicitly following those
rules as interpreted programs.
This processing was primarily serial.
Together, these
properties characterize
what you may know as System 2,
which Yoshua Bengio has recently
been talking quite a bit about,
citing, psychologists Danny Kahneman,
who actually took these terms
from Stanovich and West.
System 1 has the opposite [inaudible]
on these three dimensions,
and is characteristic of
predominantly intuitive,
automatic processes like most
visual and language processing.
Simon's particular East
Pole position evolved into
building production system-based,
general-purpose
cognitive architectures.
Their huge tome, human
problem-solving was a
landmark in the field,
summarizing two decades of work.
This time, AI heavily overlapped
with cognitive science.
Their work was based on
even experiment data,
recorded eye movements and
written transcripts of people
thinking aloud as they solved
problems; symbolic logic,
chess, and cryptarithmetic,
which had problems like this,
where you have to figure out which
digit each letter stands for
in order for the result to be
a valid additional six-digit numbers.
Their model of human
problem-solving was
heuristic search in problem spaces.
Problem spaces were graph structures
in which the nodes for
knowledge states consisting of
about 20 symbols with
operators performing
transitions between nodes
about every five seconds.
These operators were
production rules.
Like McCarthy's rule that we saw,
where the left side is a condition,
a symbolic pattern
containing variables,
and the right side is
an actual [inaudible] ,
the current problem state
matches the condition.
The knowledge state nodes
were simple structures where
each symbol with a pointer to
a chunk of symbolic
structures stored in memory.
There were three
separate memory systems;
long-term memory, short-term memory,
and external memory,
[inaudible] or a chess board.
Newell and Simon used their
human problem-solving data
to infer the characteristics
of these memory systems.
Long-memories limitation
was a slow write time
of five seconds per
symbol [inaudible].
Short-term memories limitation
was a very small capacity,
about seven symbols,
while external memory also
had limited write time.
You can see they were
extremely serious thought
modeling the human
mind as a computer.
This computer is described
here at three levels,
which the computer vision pioneer,
David Marr, would become
very well known for.
He emphasized the
importance of describing
computational systems
at three levels,
especially the highest
level, which he,
quite confusingly in my opinion,
called the computational level.
This corresponds here
to the knowledge level,
which focuses on precisely
characterizing computational
problems the system is solving.
In subsequent years,
Newell and colleagues
evolved the Soar
cognitive architecture,
which is still used in papers
presented at the Cognitive
Science Society,
and it's been applied to a very
wide range of cognitive tasks.
Following in this CMU
tradition is one of
the most eminent cognitive scientists
of all time, John Anderson,
whose production system architectures
have gone by the name ACT,
[inaudible] control before ACT.
The 2005 version of
Anderson's ACT architecture
consisted of these six modules.
In addition to the behavioral
psychology evidence supporting ACT,
Anderson was looking
for neural evidence
using functional magnetic
resonance imaging, fMRI.
The linking hypothesis between
the computational model and
neural activation was this.
The time that each
module is active solving
a particular problem
predicts fMRI responses,
amount of neural activation
in six brain regions,
as a function of
the problem difficulty and amount
of practice of the subject.
Here are some results.
Each of the six modules of the
ACT model were associated with
the brain [inaudible] is
associated with [inaudible]
The control state module
is associated with the
anterior [inaudible].
In these plots, dots are measurements
and curves are pictures.
I don't know about you, but
given the limited number of
free parameters available for
the 6th model of the data,
I think this is damn impressive.
That's it for today.
Next week's episode, we'll cover
several more key dimensions of
cognitive science space and
the seminal work of several
other prominent authors.
This will include a
lot of [inaudible].
So to briefly summarize
today's episode,
a fundamental question is what
kind of computer is cognition?
Symbolic? Neural? There are
many dimensions or
parameters along which
proposed computational
models of cognition differ.
A coherent set of
parameter values based in
symbolic computation
characterizes the East Pole view.
A wide assortment other values
characterize different
West Coast views.
The East Pole view inherits
from the Rationalist tradition,
in which knowledge derived
from experience is
regarded as secondary to know that
is innate or derived
by internal reasoning.
Pure East Pole linguistics
focuses on theories of
knowledge, that is,
competence, rather than on
algorithms for acquiring
or using this knowledge.
Elsewhere, AI pioneers
developed theories of
the computational mechanisms for
learning and using knowledge,
especially if conscious,
serial, rule-following
cognitive processes.
We also talked about
problems with some widely
accepted positions,
including these three.
All knowledge is derived
from experience.
Behaviorism and the Turing test.
Explanation by inner causes.
These problems or challenges
[inaudible] some of them from
the east and some of
them from the west.
Episode 2, as I said,
will feature more West Coast work.
Pole. Thank you very much for
your long attention and I'm happy
to entertain your questions.
>> That was great.
>> Thank you.
>> If anyone has a question,
can you raise your hand?
Click the little hand icon
so that we can call on you.
>> Excuse me, sorry.
I've also asked Tom to
keep an eye on the questions
in the chat and curate them.
So I also want to give
him a chance to pass on
any questions from the chat that he
thinks would be particularly
useful for people to hear about.
>> Okay.
>> Yeah. There weren't any
questions in the chat,
so it's open forum.
>> Okay.
>> Just raising
questions people have.
>> Are you going to
moderate this, Matt?
How is this going to work?
I see [inaudible] hand is up.
>> Really? It doesn't
show that for me.
Okay. [inaudible] do you want to
go ahead and ask your question?
>> Paul, what are your thoughts about
there seems to be a little bit of
a divisive nature in
the types of thoughts.
Of course, maybe this comes
from rhetorical reasons,
but it has reflected itself also
in our community in the sense that
perhaps there is no single
answer to these questions.
Perhaps humans, when they know
how exactly to solve a problem,
they have their rules and
the algorithm is clear.
They follow that algorithm.
When they don't, and they
have maybe some data,
they try to learn from data.
When they don't even have that
data or the memory capacity,
sort that data, and
process that data,
they maybe fall into their biases.
Do you do see any attempts on
unifying these approaches?
>> Yes. Because of the
limited amount of time,
I am not able to take which
accounts that deal with multiple of
these dimensions and assign to
various aspects of a problem
the values on these dimensions
that are appropriate.
So lots of people do the kind of
thing that you're talking about,
ask for some particular kind of
phenomenon how much knowledge
is internally derived,
how much is from this
experience and how much,
if any, is innate?
So for a particular domain
of behavior or knowledge,
we have to mix all these dimensions
together in the right way.
So what I try to do in this
class is just bring more
to awareness what
these dimensions are
so that when people are
doing these combinations,
people can see what's going on
better and see how one mixture
of these elements in
one theory might relate
to a different mixture
in a different theory.
It's certainly naive to think
that maybe in any
case that we look at,
it's all A or B in these dichotomies.
So I entirely endorse
what you're saying,
but I also think it's characteristic
of work in the field.
Does that respond to what
you're asking about?
>> [inaudible] longer
discussion, but yeah. Thanks.
>> Thanks, [inaudible] Tom,
there's a couple
questions in the chat,
did you want to do this?
>> Yeah. Think there were people
with their hands raised first.
Maybe we should start keep
going down that list.
>> Yeah. [inaudible] for some reason,
I only see [inaudible] hand
raised now, so I don't know.
>> Okay. [inaudible]
>> Maybe you could [inaudible]
>> Yeah. Hi, Paul. Can you hear me?
>> Hi. I can. Yes.
>> Yeah. One of my questions
was when we talk about
innate knowledge,
isn't it even for a child,
isn't it also acquired by
experience in a different
generation, maybe?
Even the innate knowledge
is a result of experiences,
but not by the same body that
we are trying to analyze,
but it is coming from
a different body
that has gone through a
different experience,
and that's what is the
acquired knowledge.
So it's still experiences but just
passed on as an initial parameter.
So everything is an
experience in the end,
what do you think about that?
>> Well, I think that
evolutionary psychology
tries to connect various aspects of,
for example, the biases
that [inaudible] mentioned,
as well as the innate properties
that we have some evidence for.
Evolutionary psychology
tries to relate those
specifically to their function,
and of course, of the
evolution of the species.
You can think of that
as a learning process,
it's like a genetic
algorithm that work.
Learning what then becomes the
initial state for later organisms.
You can call that
learning from experience,
I think that's a perfectly
reasonable thing to do.
Of course, a major
part of it is death.
So you could call that an experience,
not one we definitely
are looking forward to.
But of course, evolution is all about
the characteristics of organisms
that allow them to
survive and reproduce.
So that is a kind of
experience if you like.
But in any event, yes,
I think that it's not
uncommon to think about
innate knowledge as the
result of some kind of
learning process in
prior generations.
>> Cool. Okay. I had
a second thought,
which is do we also think
it from a goal perspective?
If we have a specific goal of
identifying a person's face,
for example, in a video.
Now from that goal's perspective,
we can decide whether innate
knowledge or symbolism is important,
or just going through
1,000 videos to detect
the face is enough to solve
that specific problem?
So goal-oriented view of
analyzing these approaches,
do you have any thoughts around that?
>> Well, I guess,
to the extent that what
you're referring to as goal
here defines a particular task.
>> Yeah.
>> Then certainly, there's
a lot of discussion of
the relationship with the different
mechanisms and the tasks that
they are crucial for solving.
So absolutely, the role
of inductive learning from lots of
data in certain kinds of tasks.
Maybe face recognition is
certainly discussed, but also,
there's interest in what kind
of innate characteristics of
our perceptual system
may be at work in how
we're actually processing those
faces as we experience them.
>> Okay.
>> Thank you.
>> Thank you.
>> Then I think Gonzalo was next,
then we can move to the
questions in the chat.
>> Sure. Paul, such a great lecture.
Thank you. My question maybe is
solely about something
always bother me,
maybe you can provide some insight.
It seems that people learn
faster than machines.
We have this huge machines
as big as buildings,
but they still take several hours of
magnitude longer to
learn than people.
Paul, why do you think is that?
>> Well, I think the story
that's presented from
developmental psychology,
which is showing how
early infants have
knowledge of how to organize
social events and physical events,
suggests that there are
innate organizing principles,
which means that when
the data comes in,
it's prepared in a certain way
that makes learning
much more efficient.
That's the reason that innate
knowledge was first proposed by
Chomsky to solve the problem of
how with such a small amount of data,
a child can learn
such abstract and rich
knowledge systems or grammars,
so that's the conventional story.
Does that speak to what
you're asking about,
or did I miss the question?
>> It is an answer, and thank you.
>> I don't want to imply
that I endorse the view
that innate knowledge is
the only part or
even the main part of the
answer to your question.
But you still certainly
know very little
about the space of possible
unsupervised learning algorithms,
and we may just be in the
wrong part of the space.
>> Sometimes I wonder if it
is the architecture itself.
If we modeled computers as
the models of the brain,
but it still learns so
differently than the brain,
is that the architecture that
are modeled to make them?
I have more questions
than really the answers.
But really a lot of
thought to think about.
>> Absolutely. Thank you.
>> Moving to the chat and I
guess going based on likes for
questions Ben Vedome asked,
as models like gbt3
merged towards being
well beyond the
footprint of what we'd
consider as human
bounded rationality,
both in memory, computational cycles,
and access to sensory
information, i.e the Internet.
What is something key
about understanding
human cognition that might
inform AI development?
I'm sure he'd accept multiple
things instead of just one thing.
Ben, feel free to type and if
you wanted to add things
or follow up on that.
>> Did you want anything to that Ben?
>> No. That's so good,
as always Tom does an excellent job.
>> I guess we could ask
the question about whether
the form of the knowledge
which gbt3 players is a form
that allows for maximal flexibility
of use and generality of use.
Obviously, people are looking into
this all the time and trying
to see to what extent it
can quickly or even with
zero-shot experience do new tasks
that hadn't been directly trained.
Of course, that's part
of the research program,
but I think the strong
abstractive power
and a generalization
power of rules with
variables and symbolic structures.
In supporting a wide
range of general purpose,
problem-solving in
computation makes me wonder
whether the knowledge that gbt3
has will make it able to display
the combinatorial rich capabilities
that symbolic forms
of knowledge provide.
That's just a question not an
answer. What do you think, Ben?
>> I'm not sure, but it's
a good question rather than
answer but I said that a succinct
version of the question is,
how much is our intelligence?
Especially the things you spoke to
today are governed by the fact that
we are bounded computational devices,
and as we get the ability to
build increasingly
infinite computation.
By how much will the study
of intelligence between AI
and humans actually diverge?
Or if there just never
will be enough compute,
and the ability to
generalize will continue to
be essential for both
machines and humans.
>> I think you have a very
good point that the study of
intelligence in gbt3 versus in
humans could very well take
rather different forms given
the very different
constraints under which
those different computation
operate as you say.
I think that's true and I
think that Microsoft and elsewhere
are always interested in
the question of how to build
collaborative systems
that take advantage of
what is limiting in one system
being a strength of the other,
so that the fact that they're
different is actually
a strength and not
necessarily a limitation.
In fact, sometimes my view of
the current spurred
progress in AI is that
the main relation it has
to cognitive science
is somewhat opposite from
what you were asking for.
It's not so much from
this point of view,
what cognitive science
can teach us about gbt3,
it's what gbt3 can teach
us about cognition,
which is that having had
only natural organisms to study
the cognition off until now,
we've had humans with adults,
babies, we've had brain damaged,
and intact individuals.
We've had animals all
different cognisers to study.
But now we have a very
different cogniser
to study as cognitive scientists.
As you say, there's every
reason to think it can be
quite different from
theories that are
appropriate for the systems
that are working under
the constraints of natural cognition.
I think that's a super exciting
prospect for cognitive science.
As you know, the work that you do,
a lot of people are very
excited about doing
psycho-linguistics on
language on birth,
on taking what we've
learned in cognitive science
about the questions to ask,
and the methods used to answer them,
and taking notes and applying
them to very different cogniser.
That's another important
connection between modern AI,
which actually gives us some
intelligence to work with for
the first time and cognitive science,
which I think is every bit as
exciting as the other direction
of what we learned about
human cognition might
able to inform us about
what to do in improving gbt3.
>> Thank you, that was very well put.
>> Thanks Ben.
>> Next. Andy Gores asks,
how is new technology for scanning
working brains more detailed,
the anatomy of the brain,
and experiments, and
growing brain cells,
and watching their behavior,
how are those things influencing
theories of cognition?
Or are they influencing
theories of cognition?
>> Well, a point in John
Anderson's 2005 paper was,
if I'm not mistaken,
that in order to match well,
the proposed architecture of
his cognitive model and
observations in the brain,
and this was certainly true,
I know for certain that this was
true about trying to connect
the predictions from his
computational system
and human behavioral experiments.
That the models had to be modified
in the decomposition of the
computation into modules,
had to be reformulated in light
of what was learned from looking
at behavioral data in
problem-solving and learning
behavior in the laboratory.
But also, I think,
in terms of the neural
imaging results,
I do think that to the extent
that that work continues,
we are in the business
of learning about
what deep models of
visual processing best
fit what we're seeing
in neural activation in
the human visual system or animal
visual systems, more often I think.
I think that we definitely
are seeing feedback
into the modeling enterprise
from the neural data,
and I'm sure that that will
continue to be more
and more important.
It has already been
quite important in
the study of computational models.
Maybe that's stretching
it a little bit
but somewhat computational models
of consciousness have been very
heavily shaped by what
has been observed about
neural components interactions for
different states of awareness.
I think that it's been rapidly
growing in importance,
that flow of information
from neural data
into cognitive theory development.
Did that address your
question? I guess you asked.
>> Yes, thanks.
>> Okay, yes.
>> Great. Next was
Jenny Chan who asked,
what are your thoughts on
spiking neural networks?
Do you believe they
will generalize well
across multiple types
and dimensions of data?
Or are there only certain types of
tasks that they'd be
advantageous for?
>> Well, I'm not
as much in experience making
neural networks as they wish.
So my answer is
not as informed as I would
like, but I would say that,
the use of spikes as a kind of
implementation level feature of
neural networks has not struck me
yet as fundamentally changing
the more high-level descriptions
of the computations
that are going on.
I would want to see evidence
of that before concluding
that whether neuron spike or
not is more than a low level
implementational detail,
but something that really
profoundly affects
the computation at algorithmic
and higher levels of description.
Maybe you know a reason to believe
that about spiking models,
and if so, then I would like to hear
about it and I would
think that's important.
>> Yeah. Thank you so much.
I like what you said
about how spiking
neural networks need a
larger architectural change.
Because from what I understand,
spiking networks are
just swapping out
the plane in a neuron for a function.
So it's not really changing the
structure of the brain on a whole.
>> Right. One thing I can
mention for my own work is that,
the work that was done using spiking
neurons as a way of binding
together variables and values,
for example, binding
together the agent
of some proposition with its role.
If you have a John and
Mary than you have John
bound to the receiver role.
There were models proposed in
which neurons encoding the
information about John and neurons
that were coordinated with
neurons that encoded information
about being a perceiver.
Their neural oscillations were
correlated or synchronized.
We showed that that use of time
to do variable binding was
just a particularly
temporal implementation of
more general mechanisms for doing
that variable binding that
didn't necessarily involve time.
It might or might not be a
useful way of implementing
a neural mechanism for
binding variables to values,
but our work then suggests that
it was fundamentally different
really in the computational
power that it offered.
But that's just one small bit
of relevant experience.
Thanks for the question.
>> Thank you.
>> Felix Faltings asks,
do you think System 1 versus
System 2 thinking follow a
different models of cognition?
Because it seems like neural
networks to more System 1 thinking,
while symbolic reasoning looks
more like System 2 thinking.
>> Well, I think that this
is a good context to remember.
Warning we got from
[inaudible] right off the bat,
which is basically that there are
parallel subconscious components and
conscious serial components to
very many different
cognitive processes.
They have to somehow work
together to get the job done.
In John Anderson's
Production System Models,
you fire one production at a time.
It's very sequential,
just like Neuman and
Simon's Broken System Models were,
but which determine which
production fires is
determined by a parallel process
of spreading activation
through a network.
So inside the whole
production firing process,
there are parallel
operations and those
are not accessible to consciousness.
There is always a mixture
of these different aspects.
I think that you could say that
certain cognitive processes
have more System 1 and others
have more System 2 to them.
But I think you have to look
at it pretty much that way.
I agree with what you said that,
the most natural connection is to
use neural network processing,
which is parallel and not naturally
symbolic to model the
System 1 aspects of
a cognitive process and to tie
that in somehow with using
symbolic structural computation for
more System 2 aspects
of the computation.
That's a huge challenge of
doing neuro symbolic
computation more generally.
But I think that,
we're going beyond
merely saying, well,
we can do certain things
called System 1 tasks
with neural networks and
other things that we use about
methods to do with our System 2.
We're beyond that for figuring
out how to interweave
these because any real task
is going to involve a
combination of the two.
>> Okay, and let's end on that note.
I think that's a great
place to end it.
We're out of time for more
questions, unfortunately,
but I don't know if Paul,
if you want to take a look
at the questions that were
there or maybe people
could just ask next week
when Paul's back again to continue
his presenting. Let's
thank Paul again.
>> Thank you.
>> Great. Thanks, Paul.
Thank you, everybody.
>> Thank, Pat.
