[MUSIC PLAYING]
MARTIN GORNER: Today,
machine-learning models
are fairly mainstream.
They are shipping in large
consumer-facing projects,
also in enterprise
projects, and that's
what we want to tell
you about today.
For that, I would
like to welcome
Nik Spririn from Gigster.
NIKITA SPIRIN:
Thank you, Martin.
And I also want
to thank everyone
for coming to this session.
This is a repeat of
our session on Tuesday.
Just, really, thank
you a lot for coming.
And a few words about Gigster--
Gigster is a
consulting company that
delivers digital
transformation and business
impact to enterprises
at startup speed.
We focus on new technology, AI
machine learning in particular.
I lead the AIML projects.
And today, we will talk
about one particular project
that we did for the leading
consumer electronics company.
And since this is a
technical conference,
I will show you this slide.
We put tiger in a box.
Seriously, we built
over 20 different models
for this client, including image
tagging, composition analysis,
style analysis, color
analysis, and many others.
And for this talk, we decided
to pick object detection
as our model, since
it's an exciting model
from the research
point of view, and also
because it opened a lot
of different applications
for the customers.
MARTIN GORNER: And,
Nik, this is a product
that is actually
shipping today, a product
for professional photographers.
NIKITA SPIRIN: Correct.
This is already
actually running.
MARTIN GORNER: Nice.
So tell us more.
Why object detection?
Why specifically this?
NIKITA SPIRIN: Object detection,
as you can see on the slide,
by having bounding boxes,
we can select images
with a specific object and
also in a particular location
so that it allows for
possible overlay of text,
for example, for media
applications or advertisements.
Also, surprisingly
for me and the team,
we found that photographers do
care about images like that.
And they call it animal
portrait, in the same way
as we speak about
human portraits.
And they do differentiate
those images
from animal pictures
at a distance
and actively search for
such images on the platform
that we also built
for the customer.
And finally, object detection
is a type of model that draws
bounding boxes around objects.
And because of that, it
allows object counting
in queries like "two cute
pandas," while image tagging,
or image classification, which
is the common application also
in the computer-vision
space, only
supports assignment
binary labels
of presence of a particular
object on an image.
MARTIN GORNER: And
for this use case,
you have, from your
customer, a data
set of over 300,000
images with, I think,
something like 600,000
bounding boxes.
And what we are going
to tell you today
is how we took one
model, RetinaNet--
how we trained it
on TPUs to deliver
this end-user experience.
But I would like to do
something a little bit more.
Let's train this
model here, on stage.
NIKITA SPIRIN: Let's do it.
And we will truly show the
power of TPUs for fast training.
MARTIN GORNER: So
I will explain,
in a little bit
more depth, what I
am doing here just in a second.
That's what the talk is about.
Here, I just want to
launch the script that
will start this training job.
So I'm in this one.
Sorry.
Just a second.
Here I am.
Let's call it-- let's
call the model ONSTAGE.
OK.
And I'm launching it.
So it's doing stuff--
doing stuff, and I should
have a job launched right here
in a second.
The job is launched.
I'm going to check in my AI
Platform, Jobs interface,
that I have a job running.
It has been running
for four seconds.
Let's now dive deeper
into what these TPUs are--
Tensor Processing Units.
So this is a TPU v2.
This is a TPU v3.
Those are fairly large
boards, large like this.
You have four
chips on one board,
and each chip is
a dual-core TPU.
And by the way,
on the floor, you
have a whole rack
of those TPUs--
looks like this.
And in Google Cloud,
you can provision a TPU
either by board, eight cores.
Or you can actually get
multiple of those boards
together, in what we call a pod.
And they are tied together on
a high-performance interconnect
that makes them look
like, to your application,
it's just one accelerator.
So the only work
you have to do is
to increase your
training batch size
to take advantage of all
this additional hardware.
Nik, between v2 and v3,
which one do you like best?
NIKITA SPIRIN: Short answer,
v3, and it's not because v3 is
the latest version of
TPU but because, based
on our experiments, combining
TPU with three-plus RetinaNet,
which we will cover later--
it's a model for
object detection--
TPU v3 was 3.5 times
faster compared to TPU v2.
And at the same, TPU v3 is
only twice as more expensive.
So combining those
two things together,
they are why, in
terms of computation
per dollar for TPU v3, for
that particular training job,
is better.
So our recommendation, if
you do object detection,
use RetinaNet, use TPU v3.
MARTIN GORNER: Let's
have a look at one TPU.
What is inside?
I told you-- one board,
four chips, dual-core.
What is in each of those cores?
So you have a fairly traditional
vector-processing unit.
But the specific part that makes
this special for neural network
workloads is the
matrix multiply unit.
It's a dedicated hardware
unit that performs 128
by 128 matrix multiplications.
And it does so for speed in
reduced float16 precision.
So this is a very traditional
trick in neural networks
to work in reduced precision.
Neural network training is
fairly resistant to the loss
of precision.
There are even cases where
the loss of precision,
by introducing noise,
actually helps the training.
Sometimes you even
introduce noise on purpose,
as a way of helping
the training.
But the traditional-- I mean,
the standard float16 format,
as you can see here,
has a different number
of exponent bits compared to
float32, which means that it
doesn't have the same range.
And so when you push your
model to reduce precision
using float16, you usually
get underflows and overflows.
So all of these problems are
fixable if you know your code,
but it's work.
And in TPUs, we decided to
use a slightly different
reduced precision format,
which we called bfloat16.
And you see, it's
actually float32 with just
the fractional bits cut off.
It has exactly the same number
of exponent bits, exactly
the same range, which
means that it's usually
a drop-in replacement
for your float32s.
And there is nothing to
be done on your model
when you use that.
You actually don't have to do
the work of using bfloat16.
This is something that
is done automatically.
You send the data
in float32, and it
gets downgraded in
the MXU on the input.
And then, the
multiplication between two
16-bit floating point
numbers becomes, again,
a float32, which is
what the result is.
And now, how is this
unit built, the MXU?
So it uses quite a
rare architecture
in a microprocessors,
called a systolic array.
Try to remember how to perform
a matrix multiplication.
So in matrix multiply,
each point of the result
is a dot product of
one line of a matrix
and one column of
the second matrix.
And a dot product is a series of
multiply-accumulate operations.
Multiply, plus, multiply, plus,
multiply, plus, multiply, plus.
So the only compute element
we have in this whole array
are multiply-accumulate
elements.
And I think you will have to
spend a little bit more time
offline with this
animation, but what
it does is that you load
one of the arrays into--
one of your matrices
into this array,
and you flow the numbers
from your second matrix
through this array.
And if you-- I mean,
you'll have to believe me.
As the data flows through it,
and as the multiply accumulates
are performed, and the
intermediate result gets passed
to the right, out of
this side will come out
all the dot products that
compose your resulting matrix.
The nice part about
this architecture
is, well, first of all, all
the intermediate results
are just flowing on the wire--
so nothing to store
back to registers,
or to intermediate
memory, or anything else.
Everything is
self-contained in here.
And the second thing is that
the individual compute element
is a tiny, tiny 16-bit
by 16-bit multiply
and accumulate unit, which means
that we can cram lots of them
in one chip.
And this 128-by-128
array contains
16,000 of those compute
units, of course,
with 32,000 in total.
Density means power,
and power means cost.
That's the advantage of TPUs.
And, of course, you want
to know how fast this goes.
So some of you might
remember my talk
from last year,
when I showed you
how to do airplane spotting.
I'm using this neural network
as a benchmark these days.
And so this neural network
trains in four and a half
hours, on a big powerful GPU.
On a ML Engine AI
Platform, now, you
can easily get a
cluster of those.
And with five powerful GPUs,
I can train this in one hour.
And I've chosen this example
because this is exactly
the time it takes to train
it on one Cloud TPU v2.
So just as a rule of thumb,
one cloud TPU, four chips--
roughly equivalent to
five powerful GPUs.
In terms of speed,
in terms of cost,
it's almost three
times less expensive
because we have a
density advantage
in the microarchitecture.
So now we have
powerful hardware.
We also need a model--
an object detection model.
We have a whole
library of models,
TPU optimized models, in
this GitHub/TensorFlow/TPU
repository.
Nik, which one did you choose?
NIKITA SPIRIN: We picked
RetinaNet, obviously since it's
an object-detection model.
And we also picked
RetinaNet not only
because it is the
model available
in the official TensorFlow
report for object detection,
but because it is
considered to be
one of the best models for
object detection in general.
And let's talk about it.
MARTIN GORNER: So maybe you
can tell us how it works.
Why is it the best?
NIKITA SPIRIN: Yeah.
First, before we go deep
into the RetinaNet details,
I will introduce the
overall pipeline that
is used for object
detection, and then we
will cover a special,
new idea that
was introduced in the
official RetinaNet paper.
So overall, object
detection works
as a pipeline where
we first generate
a lot of bounding boxes.
And you see them on the slide.
Those are also called proposals.
And then, we do prediction of
the class labels and adjustment
of the positions of the bounding
boxes using the neural net.
And there are one-stage neural
networks and two-stage neural
networks, in the sense that,
in one case, for example,
one-shot neural networks,
such as Overfit, SSD, YOLO--
they combine
candidate generation,
bounding box generation, and
detection stages together.
They process images only once,
and that's why they are fast.
At the same time, they
are not as accurate.
And there are
two-stage detectors,
which separate those
two stages into two.
They first generate
candidate proposals,
but then they do sampling
and post-processing
of the bounding boxes.
And then, they already
predict the class labels
and adjust the coordinates.
So RetinaNet was designed to
address those shortcomings
and to be both
fast and accurate.
MARTIN GORNER: And it's
a one-stage detector.
NIKITA SPIRIN: It's
a one-stage detector,
and it uses the
ResNet as a backbone.
And it also relies on
the pyramid structure
of the neural
network and the fact
that it has inside pooling
layers that gradually
shrink the size of the image.
And it is helpful because, when
we want to do object detection,
we want to be able to recognize
objects at different sizes,
at different resolutions.
And by having the natural
structure of the ResNet
pyramid, we generate multiple
layers of representation.
And this is called
Feature Pyramid Network,
the idea that was
also introduced
in the paper on
feature pyramids.
And we use that multi-resolution
representation combination
of those feature
maps as our image.
And then, we extend
with two subnetworks.
One subnetwork is responsible
for classification.
And this is basically
to assign class label.
And the second part is
responsible for regression,
where we use special type
of parametrization, which
was introduced in "Faster
R-CNN" paper, called anchors.
So I will not go
into the details,
but the idea here is that
we have bounding boxes,
and we adjust the coordinates
of the bounding boxes using
the neural net.
And that's the
architecture that we use.
And obviously, then, we
launched the training process.
MARTIN GORNER: And so
this architecture actually
existed before RetinaNet,
but you told me
it has one problem--
cold-class imbalance.
What's that?
NIKITA SPIRIN: Let's talk
about it a little bit.
Object detection-- as
you can see on the slide,
there are only two pandas.
And those are true positions
of the bounding boxes
around them, in red.
And there are a
lot of background
bounding boxes in blue.
And those are negative
examples for machine learning.
So when you train a neural
net, the loss function
that you have, if you sum
over all potential positions
of bounding boxes,
there are easy cases.
Or the cases coming
from the background--
they will saturate the loss
function so that it will be not
sensitive to the true objectives
that we tried to learn from
the data set--
how to predict the
position and class
label for real bounding boxes.
MARTIN GORNER: And when
you say they are easy,
you mean that the network
makes a small error on them.
But small error multiplied by
hundreds of background boxes
becomes big error, and the
pandas get lost in that.
And RetinaNet introduced
a very clever idea
to overcome that,
called a focal loss.
NIKITA SPIRIN: Yeah, and focal
loss, just for you to remember,
comes from the word
"focus" because it actually
focuses on hard
examples, so examples
which are difficult to
classify for the neural net.
And in terms of math, it extends
the traditional cross-entropy
loss by aiding the
multiplication factor, 1
minus p to the power of gamma.
And then, the way it
works, essentially,
if the probability
is high, which
means that in this example,
then the multiplicative factor
will be very, very
small, especially
if we exponentiate it.
And the contribution
of that easy example,
or the background bounding
box, will be very, very minimal
to the loss function.
And if it's the
opposite, then it
will be kept in
the loss function.
And because of
that, the neural net
will be really able to learn
the important signal of how
to focus on the hard examples.
MARTIN GORNER: So
that's a clever idea,
but this actually works.
In this picture, you
have various variants
of RetinaNet in color.
And the letters are various
competing architectures.
And this is a
accuracy-versus-speed diagram.
You see that I put a red line in
there for better understanding.
RetinaNet is above everything.
NIKITA SPIRIN: Yeah, and as we
say in the research community,
it's in the upper
envelope of the graph.
MARTIN GORNER: We
have good hardware.
We have a great model.
What is missing, now, is
a good working environment
for data scientists to put
this model through its phases.
And for that, I would
like to show you
a new feature in AI Platform,
which is called Notebooks.
So, AI Platform used
to be called ML Engine.
As we are adding feature,
we renamed it AI Platform.
If you're used to Jobs
and deployed Models,
the new feature
here is Notebooks.
In Notebooks, you can
create a new instance
that has everything installed
in either TensorFlow or PyTorch.
And you can select
that you want it--
you want just a standard
instance with CPUs,
or you can get a GPU
with it, or even a TPU.
I'll show you how in a second.
What you get is an instance
and a link, a fast link,
to open Jupyter directly
on your instance.
Actually, I've done that.
That's why it's complaining.
So here I am in my familiar
Jupyter environment.
And actually, in
this environment,
this machine is
actually TPU powered.
Since you have not
seen it in the UI,
how to create a Jupyter
machine with a TPU,
let me give you the cheat code.
You can do so on
the command line.
So I have one command
line to launch my VM,
and the second one
to create a TPU,
and a little bit of glue
here to connect the two.
It will soon be in the UI.
In the meantime, this is
how you can get there.
So let's do something
in this environment.
Let's have a look at some data.
First of all, I want to show you
how to load data because TPUs
are very fast, which means
that it happens very quickly
that you get data stored.
And you have a limit on how
fast your training, simply
because the data is not
getting there fast enough.
So in here, in this
notebook, I have
a first cell that loads the
annotations of this data set.
300,000 images-- those are
the annotations of the boxes.
You see it's a
wildlife-specific data set.
Here are all the categories.
It can recognize mouses,
hamsters, snow leopards,
and so on.
And what I like
about Notebooks is
that I can visualize
my statistics
and eyeball my data sets.
So for example,
I see that I have
a couple of fairly large
images in this data set.
Maybe I will have to
pay attention to that.
But now, I want
to see my images.
So for that, I'm going to
use the tf.data.Dataset API.
That's an API that is
designed to prepare
data sets for ingestion
by a neural network.
I'm starting with,
here, list_files.
And from a bucket
on GCS, I'm getting
a data set of file names.
Then, in this API, you can map
any function on your data set.
So I'm mapping a decoding
function, which is here.
It just basically loads the
image and calls decode_jpeg.
And now, from my data
set of file names,
I have a data set
of images in memory.
I will also need to map
and resize my images.
And then I have all the tools
I need to work on my data set,
so I can repeat it
across multiple epochs.
I can shuffle it.
I can batch my data
set for training.
And a nice one is prefetch.
With prefetch, you
are telling the system
to work on the next
batch while you are
training on the current batch.
So you can do all the
preprocessing in parallel,
with training with prefetch.
And in TensorFlow,
the way you can
iterate on a data set
to see what is inside
is simply by creating a loop.
This works in either
TensorFlow 2.0,
or if you activate eager
mode in TensorFlow.
And here I am.
Here are my image
tensors coming out.
So 16 by 16--
OK.
Batches of 16 384 by 384 images.
But you see this
is painfully slow,
and they're coming
like one every second--
super slow.
Why is that?
We have 300,000 images,
and they are on GCS.
GCS can get you
great throughput,
but GCS, as any storage
that is attached
through IP, the internet, will
have a penalty for every file
you access.
So getting 300,000 files one by
one is not a good idea at all--
won't work.
And, yeah, I'm visualizing them
so you see what kind of images
I have.
The solution is to batch
your files into bigger files.
And for that, we use a file
format code called a TFRecords.
We use it not because it
has super special features,
but because it can put multiple
pieces of data in one file,
and it has good support
in the data set library.
So let me show
you how to use it.
Nik and his team--
they've done the work
of putting those
files into TFRecords.
So now, I'm reading
here from a bucket
which has TFRecord files.
And it starts, again,
with list_files.
I have a list of file names.
Then, I apply--
I create from that,
a TFRecorddataset.
Now, I have a data
set of records.
This line is commented
out because I
do advise you to use this
interleave command to apply
Recorddataset because,
with this, you are
reading from 16 files at once.
And that is how you get a
good throughput from GCS.
You put your data into
a reasonable number
of reasonably large files,
like 100 megabytes large,
and then you read
from multiple files
at once using the data set API.
After that, I'm mapping
a decoding function,
which is here.
And here, basically, I'm
reading from every records,
all the stuff that
Nik has put in there.
So that has the
image, some metadata,
and all my bounding
boxes, and the classes
of those bounding boxes.
And so that's basically it.
So now, again, I can
visualize my images.
And let me relaunch,
here, a loop that
will pull images, 16 by 16.
So here is my training data set.
You see it has pictures
and bounding boxes--
dolphins, ladybugs, and so on.
And this time, as you can
see, my batches of 16 images
are coming out aesthetically.
This is the speed you want.
NIKITA SPIRIN: And just
pause here for a second.
This is one design pattern,
or program pattern,
that we want you to remember,
to have a reasonable number
of reasonably large files.
A reasonable number of files is
about 110 to 100, and in size,
also, 10 to 100 megabytes.
In this case, you will be
able to really maximize
the throughput and benefit
from tf.data.Dataset parallel
loading.
One additional tip
that we want to share
is about the bounding boxes and
the nuances of the RetinaNet
implementation, especially
for the data scientists
in the room.
For managers, it's just
hard, and data scientists
should be valued.
There was a caveat in the
implementation of RetinaNet
from the official
report, where there
was a limit on the number
of bounding boxes, to 100.
Somewhere hidden in
the code we started,
we thought that it will
just work off the bad.
And it was silently stopping,
and training was failing.
And the practical
suggestion for you
is to filter out the bounding
boxes from the images
if it's not very critical
for your application.
So you have an image.
Either you skip
it, or you remove
the number of bounding
boxes per image
so that this
pipeline can be used.
Obviously, if you
really want to build
a detector that is capturing
a lot of different objects--
for example, it could
be people tracking
for security
applications-- then you
have to work around on that.
But a short fix
is know, at least,
that there are 100 bounding
boxes in the implementation,
as a hard setting.
MARTIN GORNER: I
also want to show you
how to use a TPU with
your own Keras model.
So for that, let me use
the exact same data set.
But I have just taken the
first class for every image,
and I will build a very
simple classifier just
to show you how to build
a Keras model using a TPU.
So same data set-- you recognize
the data set code here.
And the second
piece of the code is
here, a convolutional
classifier in Keras.
So it uses the
tf.keras.Sequential model.
And it's a sequence of
convolutional layers,
ending with a certain number
of classes, softmax activation.
So it's a classifier.
Here is what Keras tells
me this model is made of.
And the three lines of code
that I would like to show you
are here.
First, you have one
function to find your TPU.
Then, there is a second function
to define a TPU distribution
strategy using that TPU--
why distribution?
One TPU is eight cores.
Eight.
So it's distributed training--
and a third function that
transforms your Keras model
into a model that can
be trained on TPU.
So those three lines allow
you to use TPUs in Keras.
And if I run them, you
will see here, in the logs
that my eight TPU cores that
are showing up, this notebook
is actually powered by TPU v3.
And then I can
launch the training.
Well, there will be
nothing interesting here.
Normal Keras
training will start.
It takes just a couple
of seconds to start.
I think that we can
move on because we
have other interesting
things to tell you.
Yeah.
Yeah, you see, it's starting.
So just to recap, this is how
you use TPUs in a Keras model.
Three simple lines--
find your TPU,
define a distribution
strategy that uses that TPU,
and then keras_to_tpu_model to
get a model that you can train.
And when you train it,
it will run on TPU.
A quick word of
what is going on--
so this is your VM, your
TensorFlow Python code.
That's what it's running.
You remember that
Python and Keras--
they represent your neural
network as a computation graph,
in memory.
And it's that computation
graph that is sent to the TPU.
Your TPU never sees
your Python code.
That's important
to bear in mind.
The TPUs, they don't
execute Python code.
They get the computation graph.
They use the Accelerated
Linear Algebra compiler, XLA,
to transform that computation
graph into TPU byte code.
And if you're using
the data set API,
then your data input
pipeline is also
a piece of TensorFlow
graph, which
will be shipped to the TPU.
And during training, the TPU
will be pulling data directly
from GCS.
So that's quite efficient.
And then, finally,
this is the code
for reading from TFRecord files.
Please, if you don't care
about ordering of your data,
which you never do-- you
usually shuffle your data--
please don't forget to put
experimental_deterministic
= False.
Because this tells
the data set API
that it can do whatever it wants
with the data to get it fast.
And then, you list your files.
From a data set of file names,
you get a data set of records.
You map a decoding
function to get a data
set of whatever you care about.
And then you can shuffle, batch,
repeat, prefetch your data.
And also please use
the interleave function
to apply your TFRecordDataset
transformation,
because that's how
you can load your data
from 16 files at once.
So Nik, how do you like
having a powerful TPU
v3 at your fingertips,
in a notebook?
NIKITA SPIRIN: I think having
a powerful accelerator is
definitely amazing.
At the same time, we don't want
to pay for powerful machines
just to type code.
And we also don't want to run
around and chase machines that
are idle and don't do any
useful work because we
have to pay for it.
MARTIN GORNER: And how did you
solve that problem at Gigster?
NIKITA SPIRIN: Yeah,
when we started,
we were ahead of
the game, and we
had to invent some
solutions from scratch.
And we built, so-called, GDE,
which is Gigster Development
Environment, and a set
of processes around it,
to make sure that our data
scientists on the team
are productive, and also that
we minimize the infrastructure
cost.
The way it works--
we have a few machines
allocated to the team
for quick interactive
experiments.
And we also have
the Dockerized jobs
that we start for training real
production workflow models.
So we have a lot of
Terraform scripts,
Ansible scripts that
provision the machines,
install necessary environment
variables, drivers, and so on,
and then take Docker
container, run the job out
of the Docker container, and
save the model to the cloud
storage.
And that took us a lot
of effort to build.
And it would be nice if
it comes out of the box,
or almost out of the box.
MARTIN GORNER: Yes,
Ansible Terraform scripts--
that doesn't sound like
data science, does it?
NIKITA SPIRIN: Yeah.
It's not data science.
It's DevOps.
We have a great set
DevOps engineers.
But yes--
MARTIN GORNER: I
would like to show you
a new API that just came out,
where a data scientist can--
that a data scientist
can to use to launch
his notebook as a job,
without doing a lot of work.
It's called fairing.
And that's what I used, at the
beginning, to launch my job.
And now, let me show
you exactly what it is.
So first of all, this is
fairing is the cap of a rocket.
OK?
It's where you put the payloads.
So that's the idea.
You have a payload, and you want
to launch this payload as a job
so that you don't have to
worry about the infrastructure.
It will spin up
the infrastructure.
It will tier it down at the end.
What is my payload?
So here is my payload.
It's in Notebook.
Looks very much like the
previous notebook I showed you.
And sure, let's call
this one ONSTAGE2.
What it does is that--
well, it does some exploratory
analysis of the data set.
Maybe it displays
a couple of images.
Then I have a training cell.
And then when the model,
my model, has trained,
I have an inference
cell that does something
with the trained model.
That's fairly typical
of the way I use
Notebooks to work on models.
But now, my model is
in a fairly good shape.
I want to launch its
training as a job.
And usually, you don't
want to do this once.
You are going to launch it
once with this parameter,
and a second time
with this parameter,
and then a fourth,
and then a fifth time.
And very quickly, you have
multiple jobs running,
which is something you would
not be able to do on one Jupyter
machine.
So this is the fairing API.
It has three components.
The first one is
to know what are
the requirements of your job.
For that, the easiest is to
have a Docker file describing
the environment.
I was kind of scared of that.
It's not my world.
But my colleagues showed me that
the Docker file was basically
these four lines,
listing the pip
installs that I needed to do.
OK.
I can write this.
And then, the fairing API,
itself, has those three calls.
So one call--
set_builder-- to specify
what is your base environment,
the Docker file from which it
will be building your job.
A second, one, set_deployer--
here I'm saying GCP.
GCP means AI Platform--
a job on AI Platform.
And I can specify here what
kind of hardware I want.
You see here, I'm
asking for a TPU v3.
TPU v3.
And finally, a third
line specifying what
is the payload--
one option is to
have a full notebook.
Of course, if it is
just Python scripts,
that works just as well.
But if I'm selecting
a full notebook,
I can specify my notebook
file and if I have any number
of sidecar files that I need.
For example, here, my
RetinaNet Python files--
I can specify a list
of those and then
where I want to put my output.
When I run this, what happens
is that all these things
get packaged, automatically,
into another Docker container
and a job is launched
on AI platform jobs,
using that Docker container.
What is it?
Here, it started.
Let me check that in my jobs.
Now, I haven't a
second job running,
launched, four seconds ago.
So I can stop this one.
I don't need it.
I already launched
it at the beginning.
And maybe we can check if our
model has finished training.
So let's see.
I should have here, right here--
I have this folder ONSTAGE,
and it has finished training.
This numbered folder
contains a saved model.
So let's see what
this model looks like.
I need to download this.
I'm going to upload it here
so that we can see it--
Downloads, ONSTAGE.
OK.
It's coming up.
It will take a minute to upload.
You see if I do this, it
tells me it hasn't finished.
So what I want to show you here
is that the result of my job--
usually when I launch
a job from a notebook,
I have to extract some
code from my notebook,
and then I lose my nice
environment, which I'm used.
Here, the result of the job
is a fully-rendered notebook.
So everything I have
before training,
playing with the data, all the
visualizations, will be there.
My training will be
there with all the logs,
if I want to check something.
Anything I do after training,
like playing with a trained
model, eyeballing
that it's actually
recognizing my animals--
all that will be there as well.
So let's see.
Now it's opening.
And here we go.
So if I scroll down, my
exploratory data analysis
is here.
I see my classes.
I see my training data
with the bounding boxes.
I can make sure I didn't
screw up my image resizing
and throw my bounding boxes off.
I have my training cell.
I have, of course, all
the training logs here.
At the end, I can check
that this whole training--
so full ResNet-50-backed
RetinaNet,
trained on 300,000
images on a TPU v3--
took 28 minutes and
50 seconds to train.
And, of course, you want
to know how good it is.
So let's see on a
couple of images.
Here, I'm showing your
evaluation images.
So on eval images,
in the white outline,
you have the ground truth.
And you see this fox--
pretty good.
It did the elephant.
I'm not cherry
picking my examples.
So you will see false
detections here.
There is no person.
And the chimp is not a bear.
The zebra-- OK.
The polar bear is nice.
Here, it got the
leopard, the tiger--
a little bit of an
error on the tiger.
The baby pandas are hard.
It doesn't have a lot of
babies in the data set.
So it got one.
I'm actually quite
happy about that.
In this fairly busy scene,
it sees horses and cattles.
So that's quite nice as well.
What else do I have?
I have a jaguar here.
Nobody's complaining?
Hey, this jaguar is
actually a leopard.
OK.
I didn't know the
difference either
before I started working
with this data set.
We can forgive this
ONSTAGE build model
for not spotting that.
It gives me snow leopard
there, a lion, and so on.
So actually, not bad
at all for something
that we trained on stage,
here, in less than 30 minutes.
This was fairing-- a way
of taking any payload,
but notebook-friendly,
and launch it as a job.
Just a little productivity
tool for you data scientists--
you say where you
want to deploy,
you say what hardware
architecture you want--
this can be a whole cluster--
and you say what you want
to deploy and you go.
SPEAKER: Good afternoon.
The expo floor will be
closing in 15 minutes.
Please make your way into the
escalators in the main lobby.
Thank you.
MARTIN GORNER: Thank you too.
So Nik, we have a good hardware.
We have good software.
We have a nice
working environment
with Notebooks and Jobs.
It's time to get the work done.
NIKITA SPIRIN: Yeah.
It seems that we are short on
time, so let's get stuff done.
And just a small
coincidence here,
GSD means, also,
for our company,
Gigster Solution
Delivery engine,
a process that we built
to make it happen.
And don't mix it with
as was SGD, which
is Stochastic Gradient Descent.
So by having access
to TPUs and fairing,
we were able to explore a lot
of different hyperparameters,
train many instances
of our neural nets.
MARTIN GORNER: And maybe you can
tell us where you started from.
NIKITA SPIRIN: Yeah.
I will cover it as we go.
So as you can see, we show some
interesting variations here,
among all the
experiments that we did.
First thing to highlight
here is that we
started from the
RetinaNet implementation,
not from the TensorFlow
official repo optimized for TPU.
It was just some other
implementation of RetinaNet.
And it took us about three days
to train this model on GPU.
I would not say that the
order, or two orders,
of magnitude speed
up is the thing
that we want to communicate
to you here, thanks to TPUs.
MARTIN GORNER: That's because
that model-- you told me
it had some issues.
You did a little bit
of investigation.
You had to train it at a
batch size of two, right?
NIKITA SPIRIN: Yeah, there was
a problem with the batch size.
There was a problem because
of the memory limits.
And we only stick with
batch sizes of two.
And also, I want
to say that when
we were delivering
to the customer,
we had to deliver over
20 different models.
And we had a very
aggressive timeline.
So we shifted to
production and decided
that we will get back to it
during their second iteration
of the product.
And after that, we
partnered with Google
to start using TPUs
for the process.
MARTIN GORNER: So on TPU--
NIKITA SPIRIN: So the thing
that we want to highlight here--
it took us about
$200 and three days
to train a model of
comparable quality
that we were able to
get with TPUs for $7.
So that's a great value.
And obviously,
RetinaNet optimized
for TPU is a great
contribution to that number.
The second thing that
we wanted to highlight
is the best model that we got--
it is the highest ever accuracy
in mAP, one of the metrics
that is used for evaluation
of the quality of the object
detector.
We used our favorite
TPU v3, image size, 512.
And we trained it for 10
epochs, while, in the past,
we, I think, did three
or five epochs only,
just because of the
time limitations.
We didn't have the luxury
to train our neural nets
and wait until all
the experiments finish
until we find a great
combination of hyperparameters.
And here, you see
that with TPUs, we
were able to train it into
ours to get this quality.
So that's a win.
And we are super excited to show
you this model here, on stage,
as well.
This is the model
that's currently
integrated into the product,
and it's serving real customers.
MARTIN GORNER: So let's see.
This one-- here in my notebook,
I have a little utility-- oops,
please go up--
where I can-- so let me restart
this from the beginning.
And instead of training,
I will just say, please
reload from a trained model.
It's a 512 pixel model.
And I will simply
skip the training part
and go directly to inference.
Here we go.
So now, the model is--
it will get there in--
it's reloading.
It's reloading
from a saved model.
And then, again, I will
be getting random images.
Not cherry picking anything--
random images from my eval data
set.
Now it's computing the
boxes on those images.
And, you see, the
chimp is perfect.
The polar bear is perfect.
You still have, in
white, the ground truth,
on this eval data set.
What else do we have?
That's probably not a ray.
It looks like a shark.
We have a gorilla here.
The wolf, I'm not sure.
But the Komodo
dragon is perfect.
Let's see on these more
challenging images.
Leopard and jaguar--
and this time,
in this busy cowboy
scene, it actually
sees the little dog here.
And my favorite, it got
both pandas as right.
NIKITA SPIRIN: Design--
MARTIN GORNER: Tiger.
NIKITA SPIRIN:
--costumer-centric metrics.
Make sure that two
pandas are detectable.
MARTIN GORNER: [CHUCKLES]
So this was the model.
And I think what we
wanted to highlight
is that speed velocity
of your data science team
is not just for the comfort.
It's for the cost
of what you pay.
But also, you can experiment
more and get better results.
And you actually did even
more experimentation.
NIKITA SPIRIN: Yeah,
more experimentation--
it's how to use
TPUs for inference.
And by default, we
already know that TPUs
are very powerful machines,
and they expect big batches.
While, when we serve
models in production,
the typical workflow
or load pattern
is to have one image at a
time hits API endpoint video
processing, and so on.
And if we use TPUs
in that scenario,
probably, that will not fully
utilize the power of TPU.
We thought about
some other workflow
that will benefit from TPUs.
And actually, we found this
workflow in the project.
It's not the theoretical
workflow, or just workflow,
for this presentation.
In our case, we had a data set
of over 100 million images,
and we had to build a
visual search solution
on top of that data set.
So what we did--
we had a pre-trained neural net.
We passed all images
through this neural net
to generate embeddings.
Then, we indexed that data
set of vectors using the phase
package.
And then we got the
nearest-neighbor approximate
search built in.
The entire stack
supports visual search.
And for that purpose,
we, again, used TPU v3.
And as you can see
on the slide, it
is the fastest and the cheapest
alternative to make sure
that we can complete this very
complex, massive indexing job,
a job that combines the
generation of embeddings
and supporting
the visual search.
MARTIN GORNER: Thank you, Nik.
So we talked here about
building neural networks,
optimizing them, but there
is actually a lot, lot more,
in a full-blown
project, than that.
This is kind of the
workflow you're using.
It's very complicated.
NIKITA SPIRIN: Yeah,
this is the workflow
that we used to develop
for the enterprise.
And it's very different
from having a Jupyter
Notebook on a laptop, or even
connected to the cloud machine.
We have the environment
where our data scientists do
training and experimentation.
Then, we upgraded
to the environment
where we demo our
models to the customer
every two weeks, just
following the agile iterations.
And we have some
representation of the customer.
I will not start pointing.
Then, we upgrade our
models into the environment
where we do the first
integration of machine learning
with the web mobile
part of the product.
This is where we do the
testing at the program level.
And then, we start
already moving
to the client
environment, where we have
upstaging and up-production.
MARTIN GORNER: And
so what I see here
is that you need to
move all those workloads
between yourselves, and also
between you and the client.
NIKITA SPIRIN: Yeah.
It's very critical for
Gigster because we first build
it using our talent network.
And then, we have to hand
over code to the client,
and also make sure that they
can maintain that code base,
contribute to that code
base, moving forward.
MARTIN GORNER: I think
you have a perfect use
case for work for Kubeflow.
NIKITA SPIRIN: Yeah,
we are currently
looking into Kubeflow.
We are not using it
right now, but this
is one thing that could
help us and you standardize
the development
of the pipelines,
and make it sharable
within your organization,
and also with the
client organization.
MARTIN GORNER: Shareable
and reproducible.
So just a shout out to Kubeflow.
If you're facing
these problems, that
is something you can
also look out for.
[MUSIC PLAYING]
