Dr. Bengio is a world leading researcher and
artificial intelligence and a pioneer of deep
learning. He received the ACM A.M. Turing
Award known as "the Nobel Prize of Computing"
jointly with Geoffrey Hinton, and Yann LeCun
in 2018. The award recognized their work for
conceptual engineering breakthroughs that
have made deep neural networks and critical
component of computing. He is the founder
and scientific director of Mila, the Quebec
Institute of artificial intelligence, and
a co-founder of the AI company Element AI.
He's been a professor in the Department of
Computer Science and Operational Research
at the University of Montreal since 1993,
with more than 300 publications and over 80,000
citations.
Good morning Yoshua. Hi. Thanks for joining
our conversation today. You've been working
at the deep learning area for many decades.
Could you share with us about your journey,
your mission, how it evolved?
Right, so, you know, the relationship between
scientists, researchers, and ideas can be
one that is very emotional. So I've always
been very passionate about the research I
do. But really, I fell in love with this:
what I call "the amazing hypothesis", that
there might be a few simple principles that
can explain our intelligence. And it's when
I started reading neural network papers around
1985. It was a long time ago.
A long time ago.
Maybe the papers which impressed me the most
were those coming out of Geoffrey Hinton's
group. And I kind of knew that yeah, this
was what I wanted to do and that has continued
since then. Initially, when I started in the
late 80s, very few people were doing it, but
it was something hot and you know, many people
were entering the field. And then I finished
my PhD in 91. But in the 90s, the interest
in neural networks gradually decreased as
other machine learning approaches took over.
So there's been a long period where what really
kept me working on this was precisely this
emotional aspect that, you know, I really
felt strongly about these ideas. I also looked
around and tried to understand some of the
limitations of neural networks and other methods
like Kernel methods that helped me understand
more mathematically why my intuitions were
right. And then of course, in the last decade,
things have really exploded in successful
applications and benchmarks, and the whole
field of machine learning thanks to deep learning
as it has become something that's not just
done in universities, but has become a social
thing where it's a huge business, and it's
changing society, potentially, sometimes in
bad ways. And so we also have a responsibility
about it.
Yeah. Thanks for the thought. And yesterday,
you gave a great presentation about "from
system one (deep learning) to system two (deep
learning)”, and I think, consciousness / attention
model is the core part of it. Could you share
more about your thoughts and your findings?
Yes, so it's interesting. The C word, consciousness,
has been a bit of a taboo in many scientific
communities. But in the last couple of decades,
the neuroscientists, and cognitive scientists
have made quite a bit of progress in starting
to pin down what consciousness is
about. And of course, there are different
aspects of it. There are several interesting
theories like the global workspace theory.
And now I think we are at a stage where machine
learning, especially deep learning can start
looking into neural net architectures and objective
functions and frameworks that can achieve
some of these functionalities. And what's
most exciting for me is that these functionalities
may provide evolutionary advantages to humans
and thus if we understand those functionalities
they would also be helpful for AI.
The relationship between consciousness and
attention: is it fair to say that attention
is finding the mapping from the large dimension
of unconscious set to the low dimensional
conscious set net to help with generalization?
Exactly, exactly help with transitions.
Yes, yes. And the thing that is interesting
is that this mechanism of selecting just a
few variables at a time corresponds according
to my theories to a, you can think of it as
like a regularizer. And a priori, an assumption
about the world, which humans use in order
to form the kind of high level concepts that
we manipulate with language. So, if I say
"if I dropped the ball, it's going to fall
on the ground" this sentence involves very
few concepts at a time. Attention is selected
just the right words, a few concepts, and
together, they have a strong dependency. So
you know, I can predict the effect of
some action, for example, and that's what
the sentence claims. And it gives a very high
probability to that event. And in a way, it's
kind of outstanding. It's kind of extraordinary
that we are able to make such predictions
about the future using so few pieces of information,
so few variables. And this attention mechanism
can thus correspond to an assumption about
how we organize our knowledge about the world,
so it's about knowledge representation, and
it's about language. So the kinds of concepts
that we manipulate with language would correspond
with the kind of concepts we have at the highest
level of representation in our mind.
Very great. So not only language but also
reinforcement learning like you're showing
in the RIM(Recurrent Independent Mechanisms)
paper recently published, the Atari game is
showing a great generalization ability compared
with conventional RNN.
Yes, yes.
So this notion of consciousness, I think,
for learning machines is particularly important
for learning agents. So an agent is an entity that acts in some environments like
us and animals, and the machines we might
build and robots. And the agents have this
problem that the world is changing around
them. They don't see always the same distribution.
Right. And so they need to be able to adapt
to those changes to understand those changes
very quickly. And what I'm proposing is that
the mechanisms of consciousness, help them
do that by organizing them on knowledge into
smaller pieces that can be recombined dynamically
like in the RIM paper. We can be more robust
to those changes in the world. And we found
in experiments indeed, that these kinds of
architectures allow to generalize better to,
for example, longer sequences than what has
been seen during training.
Very nice. Therefore, we no longer need to
shuffle the data, but make it generalize by
attending to only the data it should do.
Yeah, we don't want to shuffle the data back.
So when we shuffle the data, we are destroying
information, right? There was a structure
and after we shuffled, we lost that structure.
Right. That structure may have, you know,
comes from the time at which things were collected.
Maybe there, you know, initially, we were
in some regime of the data and then something
changed and the data is a bit different. When
we shuffle, we lose that information. And
of course, it makes it easier to generalize
but it's cheating because in the real world,
the data is not shuffled. What's going to
happen tomorrow is not going to be quite the
same as what happened yesterday. And so instead
of shuffling your data, what we have to do
is to build systems that are going to be robust
to those changes. And that's where also meta
learning becomes important.
Yeah, talking about meta learning, you had
a paper back in 1990s about meta learning
and learning to learn. Yes, and recently is
getting very hot again for neural architecture
search. Could you share some of your thoughts
and advancement in "learning to learn"?
Yeah. So when I started thinking about "learning
to learn" in those days, we didn't call it
meta learning. It's just learning to learn.
I was inspired by the relationship between
learning in the individual person or animal
and evolution. So you can think of evolution
as somewhat like an optimization, not exactly,
in the sense that different species get better
and better at what they're doing through evolution.
And then that our outer loop is like, you
know, a slow timescale, there's this process
that evolves better and better solutions.
But then within the lifetime of an individual,
there are also improvements due to learning.
Right. So it's like learning inside learning.
And what we did in this paper as we showed,
you can use the same tools that we had that
we're just fresh back-propagation to optimize
the two things together. And more recently,
what has been done in the last few years,
applying these ideas to optimize how the learner
is going to not just do better at the task,
but generalize, so learn how to better generalize.
And, in fact, you can learn how to better
generalize even if the world changes. Right.
So you can, you can learn to be robust to
changes and distribution, which is not possible
if you have like the normal static framework
where you assume one distribution and you
do normal training, but meta learning, in
theory at least, allows to do end-to-end learning
of how to generalize to changes and distribution
and be robust to changes and distribution.
And that's kind of significant conceptually.
Totally agree. And also, since we are adding
a "for loop" outside the learning "for loop",
the computation complexity is pretty heavy
recently.
That's why for many years, this area was not
very popular. But now we have a lot more compute
power than in the early 90s. And we are starting
to see how things like learning from few examples
can be done with meta learning. Thanks to
the extra computing capabilities of GPUs and
TPUs.
Yeah.
And it is also noticed that the carbon footprint
caused by such training is very huge. Yeah.
You have a website for that calculating the
CO2 emission, CO2 cost. What is your thought
about it? Thinking environmentally?
Right. So nothing is simple in life and there's
lots of important subtleties. So, machine
learning can be used to tackle climate change.
We wrote a very long paper, explaining many
applications in climate science and designing
better materials in better, being more efficient
in the use of electricity or being able to
take advantage of renewable energy better.
So machine learning can be used to help us
in this big challenge for humanity, which
is climate change. But at the same time, all
this computing power is potentially drawing
more electricity that comes from non-renewables
and create a large carbon footprint. So it
depends where you're running your experiments.
If your GPU is drawing electricity in Quebec,
which is where I live, it's a hundred percent
renewable, hydro electricity, and so there's
no carbon footprint. But if you're doing it
in the US, it depends where, or in China for
example, where there's a lot of coal, then
it's a different story and your experiments,
if there are big experiments can really draw
a lot of power. And what's maybe more concerning
is that researchers in the industry especially
are building gradually bigger and bigger models.
And it's growing very fast, like doubling
period every three months.
Faster than Moore's Law
Faster than Moore's law. Exactly. So, you
know, we can't sustain that expansion, eventually
it's going to take all the electricity to
run these AI systems. That's not good. So
we need people like you to help us design
systems that are going to be more efficient
in terms of energy. So tell me, you know,
how you think we should address this?
Yeah, thanks for the question. I think we
need algorithm and hardware co-design for
such challenging tasks. Conventionally, we
rely on Moore's law to give the free lunch
of performance improvements expecting the
computer to be faster every year. As Moore's
Law slowing down, we want to both look at
the algorithm (and the hardware), how shall
we reduce the memory footprint, and I think
it is the memory footprint that causes the
energy. Computation is cheap. Memory is expensive.
We had several successes like Deep Compression
where we can reduce the model size by an order
of magnitude to reduce memory. The Efficient
Inference Engine saves the computation by
skipping those zeros (zero multiplied with
anything is zero). Recently, we've been working
on reducing the cost of neural architect research
for transformers previously taking the carbon
(footprint) of five cars' lifetime.
That's another subtle thing, which is that
those numbers that have been reported in the
press that create huge footprints are mostly
due to the hyperparameter optimization searching
in the space of architectures and hyperparameters.
And that is, like 1000 times more expensive
than training a single network. So if you're
in academia like me, and you don't have access
to large computing power, and you rely on
human brains to do the search and it is much
more efficient. So you don't have as much
computing power but the students who are doing
the experiments, they, you know, they've done
many experiments in the past and they kind
of know how to explore and so they find good
solutions. Whereas the methods we currently
use for exploring the space of architectures
are more like brute force brute force. And
so that's super expensive.
Yeah, totally agree. When I just joined MIT
last year, we have only eight GPU cards, by
no way my students can do neural architecture
search. So he has to combine human intelligence
with machine intelligence in a combined manner
(to prune the space) for architecture search.
And as a result, we can do the search in a
more cost-efficient way.
That's great.
Thank you. Alright, so lastly, as you are
a pioneer for AI for many decades, what is your advice
for the younger generation in the future directions?
So, one thing I find sad in the current culture
of students and researchers in machine learning
and AI, is they're very stressed, very anxious.
There's a lot of competition. But the best
science is not done in those conditions. The
best science is done when you're thinking
long term, when you have time to really ponder
and brainstorm and try things and let the
ideas evolve. Instead, we are currently in
a sort of rush of preparing something for
the next deadline and the next deadline, every
two-three months, we have another deadline.
I don't think that's good for the field. And
it's not even good psychologically, because
you're always stressed. So, my suggestion
is to step back, to think about more ambitious
goals, hard questions, rather than what can
I do in the next few weeks or the next for
the next deadline? And listen to your intuitions.
And also, be open with your ideas, like, share
your ideas, talk about them. Even if it's
not published yet, don't be afraid of having
other people steal your idea. It's much more
profitable to engage positively with others
for psychological reasons, but also in terms
of scientific productivity than to try to
keep our things and be secretive. It just
doesn't work
Totally agree. Yeah. All right, thanks so
much for sharing (both) the technical side
and also the advice to the younger generation
of people. Really appreciate your thoughts.
Thank you for the questions.
Thank you so much.
