Welcome everyone,
thanks for waking up so
early after this
tiring conference.
So today, we have an excellent
line of speakers for
the workshop.
So the workshop focuses
mostly on coming up with very
small models that can be fit on
to your tiny store devices or
when your prediction
has to be on budget.
So the first speaker that
we have is Forrest landola.
Forrest is working at DeepScale.
And they are coming up with
these deep learning models for
embedded devices.
And after that, we'll have Shiv
from Amazon who's
working on Alexa.
So because this is an area
which is up and coming and
we have lots of exciting
problems in this area.
So it's not that we have already
well defined set of problems.
We are all struggling with this
definition of problems for
the correct questions
to ask here, etc., etc.
So please let's keep
it very informal,
more discussion-oriented.
So each of our invited
talkers for about 35 minutes.
And we will ask our speakers to
just make the talk for about 25,
26 minutes.
And the remaining time we want
to spend on questions and
answers and just discussions.
So let's keep it that way.
And in the evening,
we have a panel.
Then the goal of the panel
is to just discuss what
are the current directions that
people are thinking about.
What are the important problems?
How we can take these things
to pretty large scale.
And how very impactful
systems and algorithms.
So without further ado, let me
invite Forrest and get started.
>> So my name's Forrest Iandola.
And I'm gonna be talking about
small deep-neural-networks,
their advantages and
their design.
This talk is less
of a research talk.
It's a little bit more like
a tutorial format, I would say.
So we're gonna be not so
much beating the drum for
new research that we've done.
It's more I'd like to share some
of the learnings we've had over
the past couple of years in
developing small neural nets.
So I'll first start
with computer vision,
after 47 years of research,
finally works pretty well.
So now how do we prioritize
that and deploy it?
We'll find that making neural
nets smaller is one of the key
factors and
how we make that happen.
We'll talk about two advantages
of small neural nets,
why it's useful for
neural nets to be smaller.
Especially in embedded and
in the wild gadget
type of applications.
And then, finally, I'll go
through some design principles
that we and others in the
community have found are quite
useful for developing
small neural networks.
So computer vision,
after a long history,
finally works quite well.
So what I'm showing here is
a plot that many of you have
probably seen before.
Which is the ImageNet error and
how it's improved in
the last five or six years.
So from 2010 to 2016, it's gone
from about 30% to about 3% from
the best models that have been
reported in the literature.
So in order of
magnitude improvement.
And we've seen similar types
of improvements elsewhere
in computer vision.
For example, in semantic
segmentation, object detection,
3D reconstruction, and so on.
And outside of vision as well,
in audio and techs and
other areas,
we've seen huge improvements due
in part to new machine learning
algorithms and particularly
deep neural networks.
But also as a result
of better and
better neural networks and
new innovations in neural nets.
It's not just neural net or not.
It's consistent,
quick progress on improving
neural nets and their accuracy.
There are plenty of
seats up front, too.
So computer vision
finally works now.
What are the venues for
deploying this to create
value for real products?
So, from my perspective,
I see two big buckets of
where vision is useful.
So there's first the data
center applications,
of which there are tons.
So a couple of examples would
be social media analysis in
the cloud.
Web indexing,
government intelligence.
Things that while
being important don't
necessarily require real time
results from any of these.
A lot of this can be
done batch and offline.
The other area, I would say,
is what we might
generally call gadgets.
So stuff that's not
purely in the cloud.
So self-driving cars,
smartphones, drones and
other real hardware that's
outside the data center.
And in the sort of server space,
I think a ton has
already happened.
There are many products from
Google and Microsoft and Amazon,
and others have especially
a big server presence and
are already creating value for
lots of people.
I think in gadgets, we're just
getting warmed up right now.
So, in terms of the requirements
for what you need in order to
be successful in using vision
in these applications.
So in data center land,
I would say it's not
usually safety-critical.
You're not usually manipulating
physical hardware that
could hurt somebody.
Low-power is definitely useful,
but it's not necessarily a key
enabling technology to
get off the ground.
And then, real-time also tends
to be useful, but again,
there's a lot you can do
in batch processing and
data center as well with
vision and learning.
So, gadgets are a little
bit different.
So in gadget land, it's,
except smartphones,
usually safety-critical.
So we're talking about
things like multiple
tons of weight of vehicles
driving around on the road,
drones that could crash
into all kinds of things.
So not crashing into stuff
is super important, and so
it's safety-critical.
Low-power is really important,
and the reasons for
that vary on the application.
So on smartphones and drones,
it's often especially
an issue of battery life.
In things like self-driving
cars, often it's more an issue
of smaller processors that can
work under passive cooling,
often can be more reliable and
last for the lifetime of a car.
And then,
real-time in most gadget
applications is super important.
So in terms of smartphones,
no one likes to be kept waiting,
and in terms of [INAUDIBLE] and
drones and
cars, it's all gotta
respond real-time
cuz it's interacting
with the environment.
So, a lot of what we're gonna
talk about in this talk
I think will be especially
useful for gadgets, but
hopefully will add some value
on the data center side as well.
So to kind of restate that,
so what's the right neural
network for use in gadgets?
So, I think some of the key
desirable properties
are sufficiently high accuracy,
which is defined with respect to
exactly what problem
you're trying to solve.
Low computational complexity,
ideally such that you can
run it in real-time on
a fairly modestly sized
embedded processor.
Low energy usage, for the
reasons I mentioned before, for
battery life, for lower cost,
smaller chips, and longevity.
And then finally,
one I haven't mentioned yet and
I think the one that's just
starting to get warmed up,
is small model size.
So, let's talk about
why small model size
in particular is
quite important.
So why small deep-neural-nets?
There are lots of reasons.
I'll discuss three of them at
some level of detail here.
So first of all, small
deep-neural-nets, we've found in
our research and others
have found the same thing,
that they train faster
on large scale hardware.
So, the amount of
gradient updates you have
to transmit when doing data
parallel distributed training is
equivalent to how many mile
parameters you have in
your neural net.
And so the fewer parameters you
have, i.e the smaller model
you have, the more hardware
you can throw at it,
assuming the same amount
of computation but
fewer parameters, and the faster
you can finish the training.
Also, on embedded
processors memory or
top computation is at a premium
but memory is at a huge premium,
so small embedded processors
often have slower memory and
less of it.
And so, the more of the model,
ideally all of the model,
you can fit on the chip itself
without interacting with
the memory, the more successful
you'll often be in deploying
it on an embedded tiny
little processor.
And then finally, I think most
applications where we care about
all these different things,
about efficiency,
about accuracy and so forth.
There are tons of innovations
that will happen on the neural
net side in developing
better nets.
There will also be lots of
information we gather from
the devices themselves,
potentially.
And so
all of this culminates then,
to keep improving the devices
after they're deployed.
We'd like to over the air
update them, in other words,
ship new model
parameters to them.
And so the smaller the model is,
the less communication
you have to do for that.
So, I'll go through each
of these in just a touch
more detail.
So sorry, slide format
changed a little bit.
I borrowed these from Kirk.
These are slides we've had for
some time, but
they should illustrate
the point well.
So this work is an extension
of our CVPR 2016
paper on FireCaffe.
So, what we reported in
that paper was on 128 GPUs,
we could speed up
training by 47x by
using some all reduce data
parallel communication.
And somewhat disappointingly,
when we tried 256 GPUs on
the Titan super computer,
We only got about 60x speedup
over single machine training or
single GPU training
with GoogLeNet.
And the reason was most of
the time was going in and
communicating between machines.
So we designed this
net called FireNet,
which is similar to SqueezeNet
in architecture with a few
different hyperparameters
than SqueezeNet.
And it's substantially
smaller than GoogleNet.
So 13 megs instead of 53 megs.
And it has similar accuracy.
And out of 256 GPUs,
we got it to be 145 times faster
than single machine training.
And I think it took around,
if I recall,
three hours to train to
GoogleNet level accuracy.
This was in 2015 with two or
three generation old GPUs
compared to what we have now.
So the smaller model size with
similar compute footprint led to
faster, more
distributed training.
So another advantage of smaller
models is their amenability to
going on embedded hardware.
So as we mentioned before, the
energy efficiency in embedded
hardware land is
really critical.
And if you compare the amount of
energy used to do a computation,
so an add or multiply on
a floating point number, to
the amount of energy needed to
move that in the DRAM off-chip,
the memory hierarchy off-chip,
it's like a 100x difference.
It's 100x more expensive
to move the data on and
off-chip than it is to
manipulate the data on-chip.
And so the less we can be
dependent on off-chip memory,
the more efficiently we can do
things in embedded hardware.
And that's one reason
why smaller models
have the advantage on
embedded hardware.
And then, finally,
over-the-air updates or
we may call it continuously
updating models.
So anytime a gadget in the wild
discovers an anomaly that may
require retraining or
sending that
model to a human to take a look
and do some quality assurance.
And then add that to
the training set,
continue the model retraining
and then ship a new model.
The smaller the model is,
the cheaper this is to do.
And in some cases, like if you
look at how Tesla does their
over-the-air updates in their
cars, you connect to WiFi first.
And so shipping at 250
megabyte model over WiFi,
while not free, is reasonable.
But if you look at
the mainstream automakers like
General Motors,
they don't assume that you're
taking your car into
your garage and
plugging it into the wall every
night and being near WiFi.
So a lot of people
park on the street.
And so GM has architected their
client car interact with you.
Or I should call it car cloud
interaction system over,
I believe,
4G in their OnStar system.
And so that OnStar system,
even without doing
deep neural net stuff,
is doing normal telematics
communications in the cloud.
Based on talking to some people
who work in the telematics
space, about 60 to 70% of GM's
cost of running the OnStar
program goes directly to
telecom companies to pay for
that 4G data.
So imagine how that's gonna blow
out once we start transmitting
new neural nets to cars.
So the smaller the model is, the
cheaper this whole over-the-air
update business will be.
So that's the motivation
on why small
neural networks
are really important.
Now I'm gonna talk about
a couple of key advances in
the small neural net space over
the last year and a half or so.
And especially in
the past few months,
there's been an explosion
of research in this area.
So I can't possibly
cover it all.
But I'll talk about SqueezeNet
and about MobileNets,
which are two interesting
results in this space.
So SqueezeNet is a model
that a few colleagues and
I invented about a year and
a half ago now or
published about a year and
a half ago now.
And so it's comprised primarily
of these fire modules as we
call them.
Which are like a simplified
inception module.
So they have one filter
bank of 1x1 filters,
followed by a mix of 1x1 and
3x3 filters.
So what we called
the squeeze conv layer.
Some people called
it a bottleneck now.
So it has fewer filters than
the previous layer typically.
And then the expand layer
with a mix of 1x1s and 3x3s.
We designed it that way so we
could play with how many 1x1 and
how many 3x3 is in a layer and
what tradeoffs would be there.
So I'll get into more of the
nuts and bolts of convolutions
and designer dimensions
later on the talk.
But the main result from
the SqueezeNetwork was, so
up to that point, people who
wanted to make a smaller model,
one of the key
things they did and
still do to good effect is what
they call model compression.
So here are a few model
compression techniques applied
to AlexNet.
So that takes us from 240
megabytes to just about
7 megabytes using a technique
called deep compression.
Which deletes some of
the parameters and
then quantizes
the remaining parameters.
So that's a 35x smaller model
with the same accuracy.
So that's quite impressive.
But when we realized how
SqueezeNet compared to
SqueezeNet with no model
compression at all Is 50 times
smaller than LX net with
the same accuracy level.
So that was quite exciting and
we actually teamed up
with Song Hong and
some of his colleagues
at Sanford and Nvidia.
We did the deep
compression work and
applied deep compression to
SqueezeNet and found that we
could get another factor of 10
shaved off so that's about a 510
times smaller model than AlexNet
with the same accuracy level.
So you may ask why AlexNet?
AlexNet was a model that
was particularly popular
in the model compression
community so
we thought we would just
compare directly to that.
This was one of a few
papers that really got
people revved up about
making smaller models and
there's been an explosion
of work since then.
One of the more interesting
ones is called MobileNets, from
Andrew Howard and some of the
the folks in Google Research.
One of the key ideas they've
popularized is what we might
call Depthwise Separable
with Convolution Filters.
And so the idea of having
convolution filters that only
apply to a subset of
the channels in the neural net,
even AlexNet did some of that
with I think groups set to two
in group convolutions.
But what's cool about this is
they actually just have three by
three by one channel filters,
and a unique filter for
each channel.
And so this saves hugely on the
number of filters in the three
by three convolution layers.
And I'll get into more of this
later, and so the key result,
I would say is,
well number one, I really like
this paper actually doesn't
just report one model, but
it reports a variety of versions
of the model that have different
trade-offs in terms of accuracy,
model size, and speed.
And I think that's really
important to share the intuition
on what's the range of things
that we have found, not just
the single model that we have
found that is most impressive.
So in the context of SqueezeNet,
so SqueezeNet had
about 57 point something percent
top point accuracy on ImageNet.
There's a version of MobileNets
here that have that green arrow
next to that is about 5% points,
a little more than 5% points,
more accurate than that,
while using about the same
quantity of parameters.
So in absolute terms,
this has advanced the state of
the art over SqueezeNet and
plus it's provided a more rich
range of models with different
trade-offs in this space.
So those are a couple
of compelling results,
MobileNets and SqueezeNet, and
of course there are many
other exciting ones as well.
Let me now move on to okay, so,
small models are important,
[INAUDIBLE] small models,
so how do you create your
own small neural map?
What are the ingredients?
How do you go about it?
So before I go into that,
let me just briefly talk about
what I call the anatomy
of a convolution layer.
So, a convolution
layer basically has,
the black thing there
is the input data.
So if this is the first
layer in the neural net,
that would be the image for
example, with three channels.
If it's a layered layer,
it would be the activations
from the previous layer.
You know, temporary variables
that previously are generated.
And so it has a height and a
width, and a number of channels.
Number of channels is defined
by how many filters you
had in a previous layer,
typically.
And then there's a filter.
So that green thing
is he filter.
It's gonna heighten the width
and the number of channels.
And then typically,
there are multiple filters.
Which, the number of filters is
something you as an experimenter
would set or define.
There's often a batch size which
allows you to concurrently
compute more than one
image if you'd like to.
So a lot of you probably
already know this.
The reason I'm going into it is
cuz we're gonna be manipulating
some of these dimensions in
the following discussion.
So I'm gonna go through now
seven different techniques that
you can apply individually or
altogether, if you'd like, for
coming up with smaller models.
So the first one is
replacing fully-connected
layers with convolutions.
So in models like AlexNet and
VGG, most of the parameters
are in fully-connected layers.
So for example, the FC7 layer of
AlexNet has 4096 input channels
and 4096 filters, so that's 67
megabytes of parameters which is
about 13 SqueezeNet's worth
of parameters in one layer.
So I'm not saying
that fully-connected
layers are bad by any means,
rather what I'm saying in these
particular models, the designers
of those models decided to have
lots of filters and lots of
channels in those FC layers, and
it just turns out that they have
a lot of parameters as a result.
What we found in our
experiments is often you can
substitute those FC layers
with more convolutions.
Have fewer filters and
fewer challenges in
those convolutions,
then have a larger
activation grid.
FC layers are basically
convolutions with a one by one
activation grid so you can make
that bigger and often preserve
accuracy with about the same
compute footprint and
lower number of parameters which
gives you all those benefits as
all those models that
we talked about before.
Another technique, this is one
that we rely on heavily in
SqueezeNet, is replacing what
we call Kernel Reductions and
reducing the height and
width of some of the filters
in a Kernel Net model.
So, in particular, a flavor of
this that I want to highlight is
replacing 3x3 filters
with 1x1 filters.
And some people get confused.
What's a 1x1 convolution filter,
wouldn't that be a scalar
number or something?
Well, what we're actually
looking at is 1x1 by number of
channels.
So it's like a vector
that gets multiplied by
all the channels at each xy
location in the activations.
And so
these 1x1 filter's can't really
see outside of one pixel radius
except using the receptive field
from previous layers, but they
retain the ability to combine
and reorganize information
across channels which turns
out to be quite powerful.
And so when we were hunting for
the squeeze net model,
doing doing space exploration,
exploring a variety of models
and defining them and
training them, we found that for
one particular squeeze
net-like model,
we started out with certain
layers had all 3x3 filters.
We replaced half the 3x3
filters with 1 X 1's.
We got the same accuracy but
saved a significant
number of parameters.
And, of course a 3x3 has
9x parameters than the 1x1,
so that's pretty important.
And then, I just wanna
highlight,when looking around
and the searching the design
space and trying different
models, there's often a point
where adding more parameters in
some dimension doesn't actually
lead to any more accuracy.
And so
I call that a saturation point.
So the neural net design space
is infinitely large, but
when you can search for, what's
the point where if I add more
parameters in this particular
dimension on the net,
it doesn't help accuracy.
It just wastes more to
compute model size.
Knowing those points is really
powerful because you can get to
something that's more
optimal for your application.
The third technique,
Channel Reduction.
This is reducing the number of
filters in channels in the net.
This is really powerful because
if you used to have 256
channels and you go down to 128
or I guess maybe I could say it
better as If you start out with
256 filters, you cut it to 128
and you do that for all your
layers for example, you won't
just do a 2x reduction in you'll
actually do a 4x reduction.
Because you multiply how
many filters you have by
how many channels you have.
So this is very,
very powerful and
this is another one where I
don't have a good formula for
exactly how many filters and
channels you should have, But
you should hunt around and
look for the saturation point.
And then see what you can
tolerate in terms of accuracy
versus efficiency.
But a lot of the models that
I think people are using,
that are widely cited, you can
cut down the number of channels
and filters significantly
without diminishing accuracy.
In this fourth one, evenly
spaced downsampling, this one is
a little bit of a break from
making smaller models, and
it's more about given some
number of parameters,
how do we most effectively use
them to find the right trade
off of computational
efficiency and accuracy?
So a design choice when
designing particularly
a convolutional neural net is
typically the input data and
the output data
are different sizes, right?
So the input might
be an image and
the output might be a vector or
the input might be an image and
the output might be a semantic
segmentation map that is
of slightly different dimensions
or smaller dimensions.
So typically, we have to
up sample and/or downsample
a neural net, so in the
classification nets we usually
start with an xy grid
the size of the image and
downsample all the way
down to potentially 1x1.
So where in the net do you
want to put that downsampling?
Some we call downsampling,
max pooling,
there are other
techniques as well,
we'll just call it pooling for
simplicity.
So you can put it all
the way at the beginning.
So you can squash those,
the height and
width of the definitions down
a lot in the beginning, which
is kinda bad for accuracy but
doesn't use much computation.
Or we could push the
downsampling towards the end, so
good for accuracy bad for speed,
and by the way I'm assuming that
we have some fixed number of
filters, it filters sizes of
a layer just deciding where the
downsample with respect to that.
And so
in our experiments the one's
that let up the squeeze net and
one's we've done since then,
something that we keep finding
over and over again since then
is that, if you try a downsample
as fragilely as possible
from beginning to end,
that tends to be great for
accuracy and a reasonable
trade off in terms of speed.
And also, stacking a bunch of
downsampling layers too close
together, we've found,
can make them all diverge or
can make it hard to train,
so don't shock the all, just
gradually downsample, and things
will probably work pretty well.
So back to making small models.
So this is the fifth technique,
depthwise separable
convolutions.
It goes by different names,
some people call this script
convolutions, or cardinality.
And sometimes it means slightly
different things by each of
these terms, but the basic idea
is, rather than having each
convolution filter apply to
all the channels, we have each
convolution filter just apply
to some of the channels.
So this idea's been around for
a while.
Even Alex Net this in a fairly
gentle and conservative way.
What's interesting with
particularly papers like
MobileNets and ResneXt is
the discovery that you can get
pretty high accuracy
with neural nets.
Even if most of the three by
three filters in your model,
each only apply to one channel.
So you have a separate and
unique filter for
each channel in your
neural net layer.
And the big surprise that I
don't think any of us expected
is that it actually doesn't kill
the accuracy when you do that,
usually a fairly modest tradeoff
and it's often worth that.
And you could add more layers or
more filters and build up to
your previous accuracy
level while doing this and
still be coming out
way ahead on how many
parameters you have
in your model.
So that's quite exciting.
So, before I actually speak to
this slide, let me just say.
So if we'd actually apply these
five optimizations to our model
so far,
where we typically end up is
with most of our parameters
being in 1 x 1 convolutions.
Our model is a few separable
3 x 3 convolutions and
a boatload, and
it'll be a small boatload of 1 x
1 convolutions what's remaining.
So, how do we cut down the
number of parameters in those 1
x 1 convolutions without
hurting accuracy?
So one idea would be to
take those 1 x 1s and
again, have them only apply to
a few of the channels instead of
all the channels.
But the problem is if you have
your 3 x 3s or your 1 x 1s,
let's say like in MobileNets
you're alternating between 3
x 3s and 1 x 1s.
And none of those filters
apply to all the channels,
what you end up with is a bunch
of different neural nets that
never talk to each other.
And go from beginning to
end without exchanging any
information, and
you lose a lot of
representational power that way.
So, an interesting approach,
that at least I first learned
about in this paper that
just came out last month
called shuffle mat, is what you
might call a shuffle layer.
This is kind of shockingly
straightforward thing and
I sort of can't believe I didn't
think of it myself either
makes a lot of sense.
So you have, you could
potentially have all the layers
in the neural net only apply
to a subset of the channels.
So your 1 x 1 could be 1 x 1
by a third of the channels or
a tenth of the channels or
something.
And then you do a shuffle,
which I might call a transpose.
So that those channels
do get mixed around and
the next set of
depth-wise separable
convolutions have exposure
to more of the channels.
So, I actually haven't gone in
this paper to enough detail to
really summarize how
well it works and
what the tradeoffs are,
but I think
this general approach will be
very interesting in the future.
And then finally,
the seventh approach and
the last one, compression and
distillation.
So this is a whole
research area in and
of itself that I'll
just briefly touch on.
So model distillation is
one approach where instead
of training directly
on training data.
You're trained to replicate the
sort of probability distribution
that a high-accuracy probability
larger model would produce.
And Geoff Hinton and that team
made a lot of noise about this
with a paper called
Dark Knowledge, which was really
interesting, but it focused more
on some of the smaller datasets.
And it was kind of hard to
figure out how it worked
in larger scale
practical applications.
But this paper called mimicking
efficient neural nets for
object detection really
sold me on distillation.
It shows significant
improvements and
sort of the ability of
a small model to achieve
high accuracy through doing
this distillation approach.
That's quite exciting.
And then finally, the
compression side is inflationing
impressions, so there's a lot
of research going on here.
Decompression is one
of the approaches and
it unifies three different
techniques into one system.
So it deletes a good chunk of
the parameters in the model.
It quantizes
the remaining parameters.
I think in
the decompression paper,
they did the quantitization
pretty aggressively and
used a code book, which is kind
of hard to compute efficiently.
But there's more recent papers
like the one called [INAUDIBLE],
which showed that you can get
at least down to eight bits and
just do linear
quantitization and
preserve more of the accuracy.
And get a lot more efficiency.
And then finally Huffman
Encoding after doing either two.
So, in extreme cases,
you can find as much as 30 or
40 x reduction model size
through heat compression and
other contemporary techniques.
I would say even to really
small models like SqueezeNet,
we've found,
you can do the compression and
get at least a 10x
compression ratio compared to
densely represented floating
point numbers for the rates.
So, deep neural nets have been
applied to variety of vision
tasks outside of classification,
online examples are more
classification oriented.
So I'll just go through these
quickly as we were low on time.
So, SqueezeNet when you use for
Style Transfer,
quite effectively,
the Not Hotdog application for
those who watch Silicon Valley.
They actually
released a real app.
One version of it
used SqueezeNet and
the other one used
a version of MobileNets.
Chinese character recognition
connects savings while defining
the state of the art
on accuracy.
Semantic segmentation, this
field is moving super fast and
it's hard to keep up,
but this ENet model has similar
accuracy to the smaller version
of a model called SegNet,
which is quite popular.
And it's 167 times smaller,
which is quite exciting.
And the big technique they
use in this, I think,
is really channel reduction.
The model just doesn't
have many channels.
It turns out you can get pretty
respectable accuracy on semantic
segmentation with few channels.
So the last thing, I think
there's a brave new future
that will be enabled
by small neural nets,
and there's a set of open
problems to go along with that.
So, how small can we get?
You know, I think there's
a lot of experiments going on,
including the ones
I've discussed here.
I'd be excited to see a more
theoretical underpinning as well
towards what do you need, in
terms of representational power,
to solve particular problems.
Algorithmic Information Theory
may be one avenue to base that
theory on.
Also, can we fit deep neural
nets with state of the art
accuracy?
So again, 3%, top 5 error
on ImageNet, for example,
in like a 2 Watt embedded
processor's cache, or less?
We can kind of put neural nets
everywhere if we can do that.
And what can we get out of,
say, a 10 kilobyte neural net?
And finally,
I've mostly focused on vision,
cuz that's what I know best.
But I'm interested to see
which of these techniques
are particularly effective in
audio and text and elsewhere,
and what new techniques
might be needed there.
And then this is pretty
much the last slide.
So, let me make an argument that
however important we think small
neural networks are today,
that value is going to
go up significantly.
So to my knowledge, there are at
least 15 different startups
working on brand new computer
architectures targeted at
specific types of neural
net computations.
And, their objective is anywhere
from a 10 to 100x improvement in
computation-per-Watt efficiency
on these applications.
And I think going from general
purpose to special purpose
processors for neural networks,
there's a huge opportunity there
to make things more efficient,
so it's very exciting.
Meanwhile, I don't see a lot of
low-hanging fruit in memory.
So memory is still improving,
an 2X
improvement on a typical metrics
of memory would be a big deal.
And so, to put it all together,
unless we're able to
fit the neural net parameters
and activations on the chip
itself without a lot of
off-chip communication, people
will build these amazing neural
knot accelerators and find all
their power going to memory,
which would be unfortunate.
So, I think the importance of
small neural nets only stands to
increase.
So, with that, I managed to get
through an entire talk without
talking about our
start up Deep Scale,
which is focused on building
small and efficient models,
particularly for
the automotive industry.
And we're working with a number
of big auto makers that you've
probably heard of on deploying
these in real vehicles.
And, this is all very exciting.
And we're always open to
bringing new people in to work
on this mission with us.
So, are we there?
I think we have about three or
four minutes for questions.
>> [APPLAUSE]
>> [INAUDIBLE]
>> I'll
just repeat the question,
that's fine.
>> When you say keeping them
on the chip as opposed to RAM,
what does that exactly mean,
like does that mean it's hard
coded and cannot be changed?
>> I see, so the question is,
what does it mean,
on versus off-chip and does
it have to be hard coded in the
computer architecture itself.
So the way computer
architectures these days tend to
work is sort of
a memory hierarchy.
So you've got your DRAM,
L3, L2, L1 cache, and
then register file.
So often, at least a couple
layers of that cache
plus the registers
will be on-chip.
And those are modifiable
information, so
it's sort of like how can we
keep everything in cache?
And that may be a traditional
automatically managed cache,
based on heuristics, like
most recently used or things.
Or, it could be the developer
actually controls some of
how things are stored
in the cache.
But yeah, it's typically SRAM or
other type of technology for
storing temporary
variables on the chip.
>> Thank you for
your presentation.
>> I see.
So your question is, this talk
we mostly talked about inventing
new neural net architectures,
but what's the, how does that
fit with reducing the bitwidths
or quantizing the activations or
model parameters?.
So this deep compression
work does that,
there are of course many
techniques of that.
I think the two are better
together, right?
So we wanted to find the right
model and compression technique
and also we can define
training protocols where from
the first pass of gradient
it's already compressed or
we can train uncompressed.
And this approach actually
does some training,
does the compression which
includes lower bitwidths
of the parameters and
other approaches also do
orbit bits for activations.
Then actually does
some work on the fine
tuning training after that.
So yeah that's a wide
open space I think and
it's all very exciting.
>> So, do you think
that a compressed or
smaller neuro network is
somehow fundamentally
more likely to be vulnerable
to adversarial examples?
Or do you think
that it is really,
making the network smaller
is really independent of
the vulnerability to
adversarial examples?
Obviously this is incredibly
important because we're talking
about IOT examples, right?
>> Right, got it.
How does model size
rectify with adversary?
So I think mainly
the adversarial stuff, I
guess the first
order I don't know.
That would be a really
interesting study
to do an adversarial
paper where you have
a variety of model sizes and
see if there's any correlation.
The adversarial stuff, I think,
is a great news story and
a really important
area of research, but
I think we should be really
clear on what it means, right?
It means, like you can
fool statistical models by
tricking them with
more statistics, right?
You can, in a different way,
fool humans the same way.
What it taught them with cars is
there's a lot of freak-out about
that, but I would say,
well you can also just
throw rocks at cars.
That's equally illegal.
But yeah, I think
the choice of model versus
the ability to be
fooled by that.
I guess a related question would
be how much of the wrong thing
does memorize, right?
And so I guess there have been
studies that show that small
models tend to be less
subject to over-fitting and
maybe require less
regularization as a result.
So if those two things
are correlated,
that might actually be a vote
in favor of small models.
>> So I want to have a question.
So rather than deciding this
that you know, I'm using a one
by one filter here, and then
a three by three filter here.
Can you notice how three by
three filters are but the three
by three filters that you want
to learn in the beginning,
you want them to be very sparse,
so you put some additional group
data constraints and
then like algorithm or
the data dictate what type of
filters you want to teach there.
>> Interesting question.
So, there's definitely
been a number of papers in
that area of when to enforce
the sparsity of the model,
whether it's from
the beginning or later on.
I would say one frustrating
thing about arbitrarily sparse
models, let's say you have
three by three's everywhere or
ten by ten's everywhere or
something, right, for
your filter size and you delete
most of the parameter's later.
Well now you've got
basically a sparse matrix or
sparse complication
problem on your hands,
which is much harder to
implement efficiently.
So if you can kind of convince
the sparsity to be in a certain
shape like one by ones or
something, at the end
of the day often it's easier
to compute those efficiently.
Cuz sparse matrix
math tends to be
a lot harder to get
maximum efficiency on.
But like, I think it's something
that you should throw into
the mix, for sure.
>> But there is during training,
right?
Once you train the model then
you know where your sparsity is
and maybe like implementing
it on the device you can
take care of, you can explore
the sparsity and implement it.
>> It's probably true.
Yeah.
>> Compilations there.
>> Yeah, a lot of options.
Yeah.
[LAUGH]
>> I had a question.
[INAUDIBLE]
>> What do you mean by
example by example?
>> Meaning easy versus hard,
[INAUDIBLE].
>> I see, so can we change
the model architecture of that
based on what data
we're looking at?
>> Yes.
>> Gosh, I mean, I think
there's a lot of options there.
In some sense like the ResNet
and stuff, and more recently
the DenseNet stuff where it
has a lot of bypasses and
things to have sort of variable
depth models sort of does that.
But not necessarily,
I don't think there's an if else
statement for which kind
of data you're looking at.
It's more, some samples probably
need fewer layers to be able to
fit them and so they might
wanna use some of those layers.
But yeah, some way to
inspect the gradient, and
this is being wasted here,
that's being wasted there,
kind of interactively redesign
the model during training.
I think that would be
really useful in a fully or
semi-automated way, but
I don't think we're there yet.
>> Okay, so
let's thank the speaker again.
>> [APPLAUSE]
