SPEAKER: Please give a warm
welcome to Brett Koonce.
[APPLAUSE]
BRETT KOONCE: Thank
you all for coming.
Thank Google for having me.
Today, we're going to talk
about convolutional networks
with Swift and a
little bit of Python.
Very broadly, my goal is to
go from about as basic a level
of this subject--
convolutional networks
for image recognition--
to arrive at the current state
of the art in this field.
Towards this end, we'll
sort of do a quick overview
of neural networks
in general, and we'll
look at a 1D version of
the MNIST demo data set.
From there, we'll
introduce convolutions,
and we'll tackle the
MNIST problem again using
a 2D approach.
From there, we'll look at how
we can stack some convolutions
together, and we can tackle a
slightly larger problem called
CIFAR.
From there, we'll look at how we
can continue to build up these
blocks of convolutions
in order to produce VGG,
which was a state-of-the-art
approach in 2014.
From there then, we'll look
at how we can modify ResNet
slightly to produce ResNet.
And then from there, we'll
have a very solid modern
state-of-the-art approach.
Then we'll do a quick demo
of building and training
that on a TPU.
After that, we'll look at a
very recent paper in this field,
which is called EfficientNet.
And then we'll do a demo
of running EfficientNet
on a edge TPU device.
So here we go.
Computer vision is a very
large field, we'll say.
Four well-known areas are here.
To put them in the canonical
cat and dog example set,
we have image recognition, or
is this a cat or dog picture?
Object detection, or where
is the cat in this picture?
Image segmentation, or
which pixels are cat pixels?
And then finally,
instant segmentation--
how many sets of cat pixels
do we have in this picture?
Neural networks were kind of
a curiosity for many years.
A lot of traditional
machine learning
is centered around
reducing problems to one d
or trying to reduce
the search space.
Neural networks kind of
go the opposite direction.
They basically learn
an activation function,
and their power is that they can
learn any activation function.
As a result, they're
really good at separating
high-dimensional data.
Images, although we don't
think of them this way,
to a computer, are a
five-dimensional problem.
You have three different
color channels and then
a height and width channel.
And then your mapping it to some
sort of category at the end.
In order to solve
these sort of problems,
we use a technique
called back-propagation,
and then we often end up having
to chain large quantities
of functions together.
So a applied to b, b applied
to c, c, d, e, and so on,
and so forth.
In order to do this, we
use a much beloved rule
from calculus called
the chain rule.
Everybody hates the chain rule.
And so somebody said,
well, heck, why don't we
have the computer keep
track of all this?
So auto-differentiation
is not really
a new idea in this field.
It actually comes
from the 1970s or so.
What's new and interesting,
what we're doing now today
is we're combining this
auto-differentiation
with the compiler
itself in order
to make it much more easier
for the computer to reason
about what's happening.
So Swift, the language,
I would argue,
isn't in itself
particularly special.
It's been out for
a little while.
Many iOS programmers
have adopted it.
And it's brought some functional
programming language concepts
to the older C crowd.
But the real power of
Swift, I would argue,
is that it's closely tied
to the Clang compiler,
and so it can very easily
reach in and assess
the capabilities of Clang.
So here at the bottom, we have
sort of the basic building
blocks of neural networks.
We have the perceptron.
This is actually from 1958.
Neural networks are far
older than you might think.
We have feed-forward
networks, which
were the first major
single-layer improvements.
And then what we're
going to do next
is build a deep feed-forward
neural network, very simply
quite identical to this one.
And then we're going to add
some convolutions in order
to produce a deep convolutional
neural network on top.
So MNIST is a well-known
data set in this field.
It's a collection of hand-drawn
images converted to gray scale.
The images are 28 pixels
across by 28 pixels down,
but we're not even going
to treat it as an image,
we'll say.
So what we're going
to literally do
is take each row of
the MNIST data set
and convert it into an
extremely long vector, so that's
what this second picture over
here is trying to demonstrate.
Then we're literally going
to take this vector, which
is 28 by 28, or 784
rows long, and run it
through two 512
densely connected
neural network layers.
And then finally,
at the end, we're
going to map it to 10
different categories,
so the numbers 0 through 9.
So I originally set out to
write this code for you all.
But this gentleman
named Juan, another GDE,
had a very beautiful example
of this all together,
so I simply modified
his demo slightly.
So here's a 45-line demo of
how to solve the MNIST data set
using Swift for TensorFlow.
This is all of his code.
And what we have
right here is just
our very simple neural network.
We have our input layer of
784, two 512 layers and then
our output.
And because we've used all
this within Swift things,
we can then run
everything through
to produce our final result.
So let's run this very simply.
OK, so we've made a
simple neural network
that's able to have 93%
accuracy on the MNIST data set.
To be honest, we're
kind of cheating.
We're using larger,
denser layers
than you would normally use,
but we'll come back to that
here in a second.
Next, let's introduce
convolutions.
I would love to throw a single
slide up here and explain
to you convolutions
very simply, but it's
a pretty complicated subject.
I stole this slide
from an Nvidia deck
from a year or two
ago, but they're simply
explaining how you can use
convolutions to produce a blur.
So we have our
input image, and we
have over here our
blurred output image.
Our convolution
is just all ones,
and so each result
pixel is simply
the sum of the pixels
in the original image.
So we do it once here.
We step over a row of
pixels, do it again,
and repeat this process
over and over again
until we have a full
result input-output image.
You don't really need to
understand the blurring part
per se, but you really need
to understand this striding
concept.
The next convolution we
need to add is maxpool.
This is very simple to
understand, I think.
We have a collection
of 16 pixels.
We want to produce
four pixels out,
and we just look in each
collection of pixels
and take the largest one out.
So let's revisit
amnesty MNIST again,
but this time we'll use a
proper 2D convolutional approach
to solve our problem.
We're going to take our input,
which is now just 28 by 28--
same input-- run it through two
layers of 3-by-3 convolutions
of maxpool.
And then we keep our time
two densely connected
512 layers in our final output.
So this code looks like this.
I've literally changed
nothing from the last demo.
This is all the same.
We've simply added
these two convolutions
and this maxpool operation.
And then finally we're
processing this layer
before the second one.
This one will take
about a minute to run,
and you'll end up with 96%.
But I can assure you it runs.
So we've cut our error in half
by introducing convolutions
but not changing anything else.
Where do we go from here?
We've worked with
black-and-white data so far,
so now let's take on color.
We'll look at the CIFAR data
set, which is slightly larger--
32 pixels by 32 pixels.
But now we have three channels--
red, green, blue data.
But our same basic approach
that we've used before
can be used here as well.
So we have our 32 by 32
by now 3, our same two
layers of convolutions
in a maxpool, another two
layers of convolutions in
a maxpool, our same two
densely connected layers,
and finally our output layer.
For this demo, I modified one
of the demos in the Swift models
repository.
But conceptually, it's more
or less identical to the two
tricks we've looked at before.
We'll run this.
It'll take a while
to run, and we'll
end up somewhere around
60% accuracy-- or sorry,
70% accuracy, which
isn't great for CIFAR.
But we've shown how we can
simply tackle the larger
problem by adding these blocks.
So we might take
this concept and sort
of parameterize things.
We might say, OK, we have
two layers of convolutions,
two more--
if we could somehow convert
this into a function.
We might even then say, well,
why not have two layers?
Why not have three or even four?
So if we took this idea,
and jumped in a DeLorean,
and went back in
time five years,
we'd be able to possess
the state-of-the-art image
recognition network
in the world.
This is VGG 16, which won the
ImageNet competition that year.
But it's nothing
more complicated
than the tricks we've
looked at so far.
Now we're working with
the ImageNet data set,
so we have about a
million pictures,
1,000 different categories.
And our images are now
a little bit larger,
so we've gone to 224--
224 pixels by 224 pixels.
But everything else
remains roughly the same--
two layers of
convolutions, the maxpool,
two layers of
completions, maxpool,
three layers of convolutions,
maxpool, three, three.
Our dense layers for this
one are a little bit wider,
so we go from 512 to 4,096.
And then finally for our output
layer, we use 1,000 categories.
So where do we go from there?
Residual networks are the
next important concept
to understand.
What we have over here on
the left side of these graphs
is VGG 19, which is the
cousin of the VGG 16
we just looked at.
The only difference is
this is 22, 444, and then
out to the end.
So we have next to it is the
backbone of the ResNet 34
network, which is conceptually
more or less identical to what
we've looked at before.
We have an input layer, three
layers of 2-by-2 blocks,
four layers of two
3-by-3 blocks together,
six paired layers
of 2-by-2 blocks,
and finally three
more at the end.
The final crucial trick
the ResNet introduces
is that now we have this concept
of this residual layer, which
goes down the far side.
Basically, we connect
each layer of these blocks
together with an extra, we'll
say, sort of shortcut path.
Conceptually, neural
networks are really lazy.
And so if they
can get the answer
at a higher place in
the stack of blocks,
they'll sort of
shortcut it to the end.
As a result, whereas VGG we
can technically make it larger,
we're starting to hit the
limits of how much noise will
be introduced in the function,
whereas residual networks can
be scaled up much
larger to solve
bigger and better problems.
So this network
here is ResNet 34,
but we need to make one more
change in order to ResNet 50,
which is a very solid modern
state-of-the-art approach.
So we take our two
layers of 3-by-3 blocks
and replace them with this
1-by-1, 3-by-3, 1-by-1
convolutional stack
at each place.
So each pair of
blocks gets replaced
with this set of three.
So now let's train ResNet
50 on the ImageNet data
set using a TPU in the cloud.
Now, the first command
we used is a ctpu up.
I ran this 20 minutes ago.
This is basically some
shell bookkeeping,
and then the rest is all
just pretty much standard.
There are some errors
because they're throwing
a lot of warnings for TF2.
But you can ignore that for now.
So basically, if we leave this
running for about 12 hours,
we'll have produced ResNet 50
trained on the ImageNet data
set.
ResNets have kind of proven
to be unreasonably effective,
we'll say.
Although this
network's from 2015,
many other convolutional
neural networks
have tried to take its place
at the top of the throne,
we'll say, so to speak.
But it's kind of held its
status as the strongest network
out there.
There's a couple
reasons for this.
But basically what
I would argue is
that this 1-by-1, 3-by-3,
1-by-1 convolutional block
layer is not really
more powerful
than our 3-by-3 layers.
The trick of this
bottleneck layer, we'll say,
is hidden in this
last 256 down here.
The last layer of
this bottleneck
increases the number of
filters by a factor of four.
We could technically do this
with the 3-by-3 network.
The problem is it's
really expensive.
So conceptually, I would
say that this 1-by-1-by-3
convolutional layer is
weaker, but it's also cheaper.
And because it's cheaper,
we can do more of it.
And because we
can do more of it,
then it ends up
being more powerful.
So in order to replace residual
networks, what we don't need
is something that's
technically better.
People have built better
networks, we'll say,
by making them both
wider, deeper, and larger.
But what we really
need is something
that's more efficient.
So this paper came out
in May of this year,
and it's the culmination of
several years of research
by the Google team.
What they've done
basically is parameterized
all these possible
values, and then they've
turned the reinforcement
learning algorithm
loose to search through
all the possible search
space of these networks.
The end result is
this EfficientNet,
which is the new state
of the art in this field.
Like I said, I
would argue this is
the result of a lot of effort.
The original NASNet
paper came out,
and the computer
sort of produced
many weird-looking
architectures.
And people didn't really--
they were just like, oh, you've
thrown a bunch of CPU power
or DPU power at this problem.
There's the PNAS
paper, where they
tried to improve the
search algorithm slightly,
and then the AmoebaNet
paper from last year, which
also sort of produced sort of
convolutional neural network
layers that only a
computer could love.
So what's cool to me about
this paper is effectively
they've used the
computer to sort of find
this whole group of networks,
but then they've gone in on top
and been able to apply
mathematical heuristics
in order to scale
these things up
in a reliable, repeatable,
reproducible way.
So I think honestly
only Google could
have done this also as well.
They're the only one
really with the hardware--
we'll say the compute capacity--
doing the software, the
ability to run these sort
of experiments at scale.
And then finally, they
have the research team
that's actually actively
pursuing these techniques.
This came out at the
start of last month.
We can think of the search
space for our convolutional--
our search algorithm as
being accuracy perhaps.
But we can also put different
parameters in there.
So one parameter that
we could potentially
put in there is just
the number of operations
that we're using to
produce our results.
So Google is now making
these Edge TPU devices,
which are a small little-bitty
GPU that you can buy for $75.
They put this architecture
into the EfficientNet search
strategy, and it was able to
produce this particular set
of networks, which run very fast
and whatnot on these Edge TPU
devices.
So if you look at our
ResNet 50 from before,
you can see this
is EfficientNet.
This Edge TPU-S variant
is kind of the holy grail.
It's smaller, it's
faster, and it's better.
So I would highly suggest
that you check out this stuff.
So now let's run a EfficientNet
on a Edge TPU device.
So here.
All this code's on
the internet, but I'll
step you through it here.
First, we'll run
this on a TPU 3.
The only fun trick is that this
is all pretty bleeding edge,
so you need to have the
TensorFlow nightlys enabled.
So I did that 20 minutes ago.
This one's fine on
a TPU 2 as well.
We have some more
bookkeeping here like before.
We can just copy paste
all this and run it.
So now it's all running
happily in the cloud.
This will take about
30 hours to complete,
and it will produce
a checkpoint for us.
Next, the TPU--
Edge TPU devices use ent8 math.
So what this
network will produce
is a floating-point
version, so we
need to modify our network
by converting it to ent8.
So we copy the
image file locally.
And then we can just run
this quantization script
in order to produce
our quantized network.
The only fun part of
getting this working was
that this script relies
on the XLA operations
in TensorFlow, which aren't
part of any of the default
builds, so you have to
install it from source,
or build it from scratch.
And so then, finally,
we'll go to Wikipedia
and download the panda
picture, and we'll
run the completely
standard image in that demo
that they have in
the Edge TPU demos.
The only other
modification I made
is I've provided a list of
ImageNet labels to work with.
OK, so our network
thinks that it's
looking at a panda picture with
approximately 60% probability,
but it might also be a fox.
It's not perfectly sure.
OK, to recap, our
goal is to cover
convolutional neural networks
for image recognition.
Towards that end, we built a
one-dimensional neural network
and applied it to a
very simple data set.
From there, we added
convolutions and built up
a 2D neural network.
From there, we added
more and more layers
to solve larger and
larger problems.
Then we added residual blocks
and looked at how we can modify
our block types in
order to finally arrive
at state-of-the-art
approaches in this field.
This is about all I have for
convolutional neural networks
per se.
But many of the interesting
applications of CNNs
are in completely
non-image-related fields,
we'll say.
You can add another layer,
begin to deal with depth,
z dimension.
QANet was an
interesting paper where
they converted language
models into a CNN
residual-style approach.
AstroNet is an
interesting use of CNNs
to try and find planets.
The recent AlphaFold paper--
and MultiCom is another
version, similar idea.
But they use 1D, and 2D,
and 3D networks together
in order to model protein data.
Originally, I tried to put
up a little piece of each
of these slides up
here, but this slide
got a little bit busy.
So I reduced it down.
So what you're
looking at right here
is one of the residual
blocks of AlphaZero.
The AlphaZero engine is 40
of these blocks altogether.
But if you look at this block,
it's literally a residual layer
just like we were looking
at before and in two
3-by-3 convolutions.
So the same basic
approach that we
used on the MNIST
approach, MNIST data set,
can, in a different context, be
used to solve the game of Go.
So that's all I've
got, and I'd like
to thank you all for coming.
[APPLAUSE]
