[MUSIC PLAYING]
ARUN SUBRAMANIYAN:
So today we're
going to be talking
about democratizing
AI for industrial applications
and also democratizing
AI in general.
So I'm Arun Subramaniyan.
I lead data science
and analytics
for Baker Hughes, a GE company.
I just want to start
off with some challenges
and potential opportunities
that the oil and gas
industry presents.
And although the slightest
title has an oil and gas
industry-specific
slide, this is pretty
common for most of the
traditional industries that
are out there today.
Think about the left
hand side of the equation
where you have assets
systems, very large
scale critical infrastructure
that has been commissioned
over the last 30 to 50 years.
They are aging.
They're in different
stages of aging,
different stages of
performance, different stages
of reliability.
They all need to be somehow
upgraded and kept alive
at 90% or about reliability
and availability.
To make things more complex
or rather interesting,
a lot of the workforce that
has kept these systems up
and running live and
understand these systems well,
are slated to retire in
the next 5 to 10 years.
So even if there are
enough engineers who
will get trained and onboarded
into these industries,
it's very, very hard
for them to come up
to speed and fill the gap that
these experienced engineers
and scientists and developers
would leave behind.
So there has to be
something that bridges
the gap between the two.
And the cost of any of these
systems going down, actually
is astronomical even today.
It's scary to think about what
would be the cost 10 or 12,
15 years down the line.
The number down there,
that is $6.6 billion
in unplanned downtime cost.
That is just for downstream
in North America.
So if you'd look at
the unplanned downtime
costs across the
world from upstream,
that is where you extract oil,
midstream where you transport
oil, and downstream
where you process it
and then send it off to gasoline
or any of the other petroleum
products, that numbers is
in hundreds of billions,
if not trillions.
Now that's the challenge
part of the story.
The opportunity side
is on the right side.
Now oil and gas industry has
enjoyed very high oil prices
for a while.
And they also have enjoyed
need or growth of need
across the board,
until about 2012, 2013,
where the oil price went from
$130 a barrel to less than $40
a barrel overnight.
And what that cost was
a sudden realization
that the industry was
slightly lagging behind
compared to the aviation
industry or the power
industry or any other
traditional industries.
And the need for
sudden digitalization,
the need for going after
something that they've not
done, came about very, very
suddenly in the industry.
And the oil prices
have stabilized,
have gone back
down and up again.
But what it has resulted
in is the top right figure
that you see there, where most
of the industrial companies
that want to invest
in AI, in any kinds
of modern technologies, 75%
of the fraction of money
that's going to
be invested in AI
are invested in
something analytics,
is going to come from
the oil and gas market.
That's partly driven by the need
and partly driven by the fact
that it was lagging
behind for some time.
Now they're catching
up quite a bit.
Now flip the equation
a little bit.
And I didn't grow up in
the oil and gas industry.
I actually grew up in the
aviation and power industry,
before coming on to looking at
problems that require analytics
in oil and gas.
And the first thing
that surprised me
was the scale and quantity
of data in oil and gas.
So compare the volumes
of data here on the slide
here to what you would
see in aviation or power.
So if you take off from
New York, land in London,
typically, the aircraft
with active engines
generates about
terabytes of data.
Out of that terabyte of data,
you store maybe 100 megabytes.
Because it's very
expensive to store,
very expensive to transmit,
and traditionally, it's
not something that the
airline industry has stored.
Compare that to the oil and gas
industry, where an offshore oil
rig, an offshore oil
platform generates anywhere
from 20,000 to 30,000 sensors
generating data every two
to three seconds, stored
in perpetuity since the day
that platform was
commissioned, so 20, 30 years.
That's a data scientist bonanza.
And if you can actually
get to that data
correctly and do something
about it, that's fantastic.
And that's really where
the opportunity lies.
Now I also want to pinpoint
one thing where analytics
means a lot of different things
to lot of different people.
What it means to us from Baker
Hughes GE and GE in [INAUDIBLE]
is the combination of three
very important things.
Analytics is not
necessarily anything new.
If you look at
engineering, if you
look at any kinds of systems
where a large scale domain
knowledge that is acquired
a combination of domain
and traditional software
has been around for decades.
That's what produces
traditional solutions.
That's what runs
the world today.
The new kid on the
block is what today
is known as the cool
parts of data science.
Think of your most favorite
algorithm, most common buzzword
out there.
Throw it in there.
That's data science-- combined
with a slightly different way
of writing software, modern
software on the cloud,
automatically
scalable, horizontally
and vertically scalable.
Now that has also
been proven to do
some things that
are not possible
from the traditional solutions.
Now if any of us were told 10
years ago that a system would
automatically tell you when
to leave the stage today, go
home, start at 4:30 so that
you can actually get home
on time navigating
all the traffic,
making sure that you're
actually on time to take
your daughter to
her dance class,
because your calendar knew
that your daughter has a dance
class, your phone
knows that you're here,
and it knows to navigate
between those two locations,
we all will probably laughed.
But that's solved today.
Or think of a case where a
shop would send you something
before you even expressed
the intent that you
liked that thing.
That's also solved.
Like, Amazon's sends
you stuff today
that you never once
liked anywhere.
A little creepy, but it's done.
And that's the needle in a
haystack kind of problem, very,
very hard problems to solve.
But thanks to the Googles
and Amazons and the Facebook
and Apples of the
world, they have shown
us a way to solve those things.
Now translate that to
the industrial domain.
Most of the time
that 20,000 sensors
transmitting to the three
second date every day
for the last 30 years,
they just tell us
that the systems are fine,
because the systems are
designed to be fine for a
very long period of time.
So if you go looking
for something
meaningful in that data, you
most of the time find nothing.
All that data means nothing
to most of the time.
The one time when
the system fails,
the one time that something
actually goes wrong,
you won't have any
data about that thing.
And that's really
what is the problem
that you're trying
to find, where
we're trying to find a needle
in an ocean of data, not even
a needle in a haystack problem.
And most of the
time, we don't even
know what the needle looks like.
And that's the problem
we're trying to solve.
And I'll give you
some examples of that.
But before I go into that one,
I know I introduced myself
as a head of data science,
but this is a tenant
that we believe very
deeply us in our teams.
And this is something
attributed to George Box.
He said it a long time ago.
"All models are wrong,
but some are useful."
Box stopped there.
Actually we go a little
further than that.
"All models are wrong, some
are useful for a short period
of time."
And if you believe
that very deeply,
then you design your systems
very, very differently.
You design your
teams differently.
You think about problems
and solutions differently.
So we don't necessarily
go into a problem saying,
hey, we have this cool
data science team,
your domain experts or
subject matter experts
will be designing the system and
running it reliably for last 30
years don't know what we know.
So we'll come solve
the problem for you.
We actually rarely do that.
Sometimes we do that
and get paid for it.
But most of the time, it's
the standard that drives us.
And I'll give you
the reasons why.
And if you think about this
carefully, any model that we
build, even the most
sophisticated deep learning
model that you can
build on thousands
of hours of GPU or TPU compute,
this standard still holds true.
So you have to have systems that
constantly update themselves,
constantly go in and
make sure that they
are monitoring your models.
So you need models that
monitor your models.
And then you need
more models that
monitor your people
who build the models.
So that is where our concept
of digital twin was born.
And most of the
concept on this slide
was born out of pure necessity.
And so we didn't really
sit in an ivory tower,
think about this, and say
this is a fantastic way
to solve problems.
We were actually
posed with problems
where things like, OK, there
is one failure in the field.
There's one jet
engine that failed.
Can you actually tell
from that one failure
do you have to take down an
entire fleet of engines down,
which causes a massive outrage
in the entire transportation
network in the world or not?
Safety issues where an
offshore oil platform just
sent off an alert.
Do you shut down
the platform which
cost you about $50 million
a day if you shut it down?
If you don't shut
it down, it's going
to cause an environmental
disaster that's
going to go costing about
a trillion dollars over 3
to 5 or 10 years.
Those are the kinds of answers
you're trying to answer.
And most of the
time we don't have
data to go about
solving the problem.
So the concept of digital
twin where the model
is a combination of
domain, we most often
just start with
domain, build around
traditional machine
learning problems
that gives us the ability to
solve known problems very well.
So if I tell you
there's a crack,
the crack is propagating
because of a thermal load.
I can give you very,
very, very precise models
that tell you how
the crack is going
to propagate under that load.
That's the known knowns.
Now if I go one
level above and say
I don't necessarily
know the temperatures
that are affecting
this particular part.
I can measure it to within
plus or minus 10 degrees
or 20 degrees.
That's a known unknown.
So the known unknown
of temperature,
I can propagate through and
tell you, well, this part
is going to fail within the next
10 days with a 90% probability.
That's where probabilistic
inference comes in.
The real advantage of
using deep learning for us
is where I can't
necessarily pose the problem
in known knowns
or known unknowns,
cases where I know that
something is going to go wrong,
but I don't necessarily
know what to go catch.
So I'll go back to the
20,000 sensor example.
We have 20,000 sensors emitting
data every two seconds.
I've written rules,
heuristics, watched all kinds
of critical things already.
I can catch maybe 90%
of the things that
are going to go wrong.
What about the last 10%?
I don't know what
rules to write.
I don't even know what
to watch, and how often
do I need to watch it for.
That's really where
deep learning systems
come into play.
That's how we actually
got into deep learning.
And then as we learned
more and started
deploying these systems
more, we realized
that it can also tell us
when to go build our models.
It can also tell us when
to go watch what models.
So I'll give you some
examples of that.
But before I go there,
I wanted to give you
an example of what we
think about as scale
or systems of systems.
So we most of the
time look at systems
that are very, very large.
So that offshore
oil platform has
roughly about 30 different
large scale systems in that.
Each system has anywhere from
10 to 15 large subsystems.
Each subsystem has
50 to 100 assets.
An asset would be something
like a gas turbine.
For those of you who have
not seen a gas turbine,
some of the gas turbines fill
rooms that are twice the size.
So very, very large
scale things that are
considered a small asset.
And they won't even be described
in this particular plot
on the right side.
And most of the time, the
problem we are trying to solve
is OK, I have data from all
of these different assets, all
of these different sensors
coming in, what is going wrong?
I already know to catch
90% of the things that
are going wrong.
Catch that 10% of the
things that I cannot catch.
I, meaning the operator
of the actual platform.
So what we do is we actually
convert the physical system
into an information network.
So we know already
how assets are
connected, how
subsystems are connected,
how systems are connected.
The connectivity is what is
shown on the right inside.
So each node there
is actually an asset
that is connecting
to each other,
maybe a pump is connected
to a fuel system,
a fuel system is connected
to a gas turbine,
a gas turbine would be connected
to a compressor, a whole lot
of different things.
That physical connectivity
is shown on the right side.
But the colors actually
come from the data.
So what we do is we
take operational data
and try to understand purely
from operational data,
is there anything
connected with each other
without knowing how
they're connected
in the physical space?
That's the one on the left side.
Now you superimpose that
onto the right hand side,
you see some interesting things.
You start seeing things
where most of the systems
seem to be connected only with
the assets inside that system.
So if you look at the blue dots,
they are only in the top left
and maybe things that
are physically connected.
But there are some green
dots on the top right
and on the bottom left.
There is no physical connection.
There's no direct
physical connection.
There's not even a second or
third order direct connection.
But your operational
data is telling you
that there's
something correlated.
Now this could be
one of two things.
It's the traditional or
the classical problem
of correlation is not causation.
Where this could
very well be hey,
are you trying to tell me that
the S&P 500 index is correlated
to milk prices in Bangladesh?
You can definitely come up with
a data set that tells you that.
Now the proof of this exercise
is can you predict failures
that you cannot predict
before and reliably?
And that is here.
So we've taken
this approach where
we combine the physical network
with the operational history,
connected it together, build
models, very large scale
models, and tell you what these
models are, and predicted what
was not known to us before.
So this is a real example
where a customer gave us
operational data for two
years worth of operations
of one subsystem.
Sorry, one system.
And that one system had
about 13 subsystems,
roughly 4,000 sensors.
That's what they call
a proof of concept,
about five terabytes of data.
We build our models.
All of these models were a
combination of auto encoders,
probabilistically enabled, and a
system of system network model.
So all of them combined
together to predict
a 30 years worth of data,
trying to figure out
if the model can predict things
that it hasn't seen before.
And the proof is we
could successfully detect
all unplanned downtime events.
The events that we hadn't seen,
we didn't know how many events.
We just predicted
16 out of 16 events.
Now I could have done that
very easily by just saying,
let me call everything
as an event.
I could have caused
1600 failures,
and that would have included the
16 failures that was actually
there.
So to make sure that a model
was not over tuned or just junk
or just flipping a
coin, we also made sure
that we can compare it with
the current scenario of how
do you catch faults.
So we were able to get
all unplanned downtime
events by reducing false
positives and false negatives
simultaneously.
And this one we
can do repeatably.
And to do this
repeatably, we actually
have to go through many
different pipelines.
So if you think about
this, you've got data.
You validate the data.
You walk through say,
an architecture space.
You give it to a data scientist.
They say, OK.
I know for this
particular problem
I can get to a model
that looks good.
They will use a particular
type of autoencoder.
Maybe I use a
particular type of LSTM.
I'll get a model.
And a data scientist
time is very valuable.
Usually you build maybe one
or two, maybe three or four
iterations and you stop.
That's a typical workflow
from the left or right.
In our case, because we
didn't have the luxury to say,
OK, I have a prediction.
You will give me the results
of a blind validation test,
and I can go back
and update my models.
I still have to go
and build models
without knowing what my
validation data set is,
what the true test data set is.
So we actually had to
go build and explore
the entire architecture space
and come up with models that we
know are not just a model,
but a very, very good model
to begin with.
So the way we do it is we
actually start with initial
samples that you see here--
those are the red crosses--
where we say we let me go
explore the design space,
and use that to build
a Bayesian meta model.
So this is a meta model
that tells you, OK,
if I change a number of layers,
if I change the activation
functions, do I use
regularization or not.
That is, if you
decide that you're
going to use some kind
of a deep neural network.
Or you can also
go in and say, let
me go will the
Gaussian process model.
Or I'll just go build a simple
linear regression model.
All of those are
different search spaces
that we're searching across.
Use it to build a better model.
And then use a genetic
algorithm to go
optimize all of those different
variables simultaneously.
So it's a massive optimization
exercise that goes from the red
dot through all the grey dots
to get you to those green
[INAUDIBLE] optimal points.
Now typically you may be used
to seeing optimization problems
as going up and
north, as to be good.
Here, the two objectives
are loss functions.
So you want it to be
as small as possible,
reducing loss in
different dimensions
to get it to very,
very small values.
You get to an objective--
to get here we had
to use TensorFlow at its edge.
We had to go back and
use Kubeflow at its edge.
And I'll tell you
how we did that.
And combine all of this with
TensorFlow probability as well.
Of course, all of
these running on Nvidia
NGC containers as well.
The nice thing is,
to get to a model,
we could go explore
more than a quarter
million architectures very
rapidly for one model.
So we're just going and
building one model here.
And the improvement in training
from just a random search
is at least 10 times.
And 10 times is what we
can achieve consistently.
So you give it to the best
data scientist we have on team.
They go and find out saying
OK, this model will work
for this particular use case.
I know I build it on the times.
Let me go build it.
Come back to that.
You still come up
with architectures
that are significantly better.
Now if you flip the
question and ask us,
do you understand
the architectures?
Do you understand why
something is working?
The answer is no.
Many times we
actually look at that
and say why exactly
is it working?
I don't know.
And I don't really
know if I ever
picked that particular
architecture,
but it works in this
particular use case.
And this is for one model.
So for the offshore oil platform
that I've been talking about,
that platform has
about 40 systems.
Each of the systems would have
about eight of these models.
And so the scale you're talking
about is significantly high.
And to get this to run
properly, of course,
cost a lot of GPU compute.
And as most of you
probably know in the Zoom,
GPUs are reasonably expensive.
And running them at this
scale can break the bank.
Now the general question we
face is are you guys crazy?
Do you really, really
want to go around
this many models at scale?
This is a plot we actually
use with our CFOs as well.
So that's the anomaly
detection model
on the left side that's
actually at scale
from the previous slide.
The model that Facebook open
sourced just last month,
on OpenGo, the the ELF model,
that model ran on 2,000 GPUs
for two weeks.
So compared to the
scale of operations
that we're talking about
where the model saying
something wrong can
give you an outcome that
will get the business to
lose tens of millions,
if not hundreds of millions,
actually is very, very limited.
The reason why there's a sticker
shock in these kinds of cases
is we are not used to
building these kinds of models
up to this point in time.
For these kinds of
operational histories,
we don't really build
these models like that.
Today, if you ask us
how an offshore oil
platform is managed,
it's probably
managed with a spreadsheet.
But what is lost in that
spreadsheet calculation is
all of the effort that
goes in, in getting
a simple answer to
the operator, imagine
what the operator can do with
a model which will tell you
50 days or 100 days before that
something is going to happen,
and you can change
your operations.
The financial impact
of that is significant.
How to do all of
these things at scale,
we do need to orchestrate
things very, very differently.
So for us, everything
runs inside containers.
Containers is a
first class citizen.
And it's not just that we get
to build these cool new models.
We also have to live
in the world of models
that have existed for
the last 30 years, which
are running reliably.
Most of the times
we can't necessarily
go record them or change them.
And most of the time we
don't need to change them.
So the workflow
that's shown here
where you start
with a Python model,
it goes into some proprietary
model say, for example,
[INAUDIBLE],, there's a Fortran
model that feeds something
into that pipeline that goes
into something else that's
Python, that might actually
run a robust optimizer in Go,
is actually fairly
common for us.
We don't necessarily
use Fortran 4 that much,
but Fortran 77 is something
we have quite a lot of.
And that's not just for oil
and gas, any industry as well.
So you need to be able to
run all of these things
simultaneously at
scale, orchestrated
and get an answer
back to the user.
Now the example I showed
before, where you actually
start with some few
red crosses, that's
the initial points
on the top, there's
an actual Kubeflow
pipeline that's
implemented to run both on
prem as well as on a cloud.
We'll show you running
this on GCP today.
That's the initial
model running.
After all of those
models are run,
Kubeflow decides that OK, let
me go build this Bayesian model,
and then spun off yet another
set of models after I optimize.
That's the second iteration.
You keep going to
multiple iterations.
Finally, you get to an
optimal architecture.
Now the data scientist who
is building these models
does not know much
about Bayesian models
or even optimization.
The optimization expert
who knows and understands
stochastic optimization does not
know what models are actually
being optimized.
The software architect who makes
sure that all of these things
are done at scale, GPUs--
when you require GPUs,
actually are available,
and all of these
things actually are
available when a data
scientist kicks off run,
does not necessarily
understand any of this.
You need to be able to
orchestrate all of these things
together to run at scale.
Now to even get there,
we had to make sure
that OK, you can go ask
for 100 GPUs or 1,000 GPUs
or 100,000 GPUs from GCP.
They may give it to you.
And it's a lot of
money on your side
to go get those hardware
actually available to you.
But do you actually
have the software
to go load them up at scale?
So this is a load
balancer that we wrote.
And this is a
result of a test run
where we have two systems
competing with each other.
Every time the
load balancer sees
that all of these GPUs or all
the resources are consumed,
it will add a node to the GPU.
So that's what's
happening on the top left.
So each time it hits
something, you add a node.
The actual load or utilization
per GPU is on the bottom left.
Immediately after a
GPU node is added,
the total utilization goes down.
So there's something else
that's watching that says,
OK, the total utilization
has gone down.
I actually have more resources.
I can reload them back up.
So there is a constant
tug of war going on
between these two systems
where the load drops and it
goes back up.
As soon as it goes back
up, another GPU node
is added, until of
course you hit a point
where you can't spend anymore.
So that's what happens
when you hit two 10 nodes
with two GPUs each.
So getting up to
this level of scaling
where you had a GPU node,
you get it instantly,
your billing starts instantly,
but most of the time
you're going to
use it instantly.
So you have to be
careful about that.
And after you're
done, you need to be
able to shut it down instantly,
because billing continues
to go.
So this one, running the
same Kubeflow experiment
that we showed before
is what is shown here.
So we ran the first
experiment, the load spikes
on the top left.
Then the Bayesian optimization
runs, so the load drops.
Then the second
load goes back up.
An interesting thing that's
happening on the bottom right,
you can see, is each of
the times it hits max load,
it actually adds more nodes.
And once all of the
initial runs are done,
it spins it back down way
below where it started.
And this has to happen
consistently across the board,
because many times we have 10,
20, 30 data scientists running
these pipelines.
And we can't
necessarily have SREs
on standby to keep increasing
GPUs or even watching.
Do you have the ability
to run GPUs or not?
I'm picking on GPUs a
bit here, because that's
reasonably expensive.
But if you use TPUs, it's
the same thing, even not
even any compute.
If you're using high performance
storage, for example,
that can also break the bank.
So you be very
careful about loading
at these kinds of scales.
And that is another example of
us showing that we can actually
run through all of
these architecture steps
and keep the GPU
utilization at or about 80%.
That alone is not a
simple problem to crack.
So when you're
thinking about scale,
also think about the fact
that if you acquire hardware
or acquire virtual
hardware, you also
need software that can
scale with the virtual
or actually real hardware.
Now how did we get here?
The title of the talk
is democratizing AI.
And I promise I'll
get back to that.
Doing all of these
things is not necessarily
unique for oil and gas.
Almost all industries
out there are solving
one hard problem or the other.
And to get to anything at
scale, you need more than one
or two types of disciplines.
And we got here by blood
and sweat and very hard work
by a lot of people in very
disparate and different
communities.
So we had data scientists.
But lumping all data scientists
into one large bucket
is also not being
that fair to them.
There are traditional
machine learning researchers.
There are probabilistic
researchers.
And also, a lot of deep
learning researchers
working together to get
us to where we are today.
Domain experts are the folks who
told us what to go solve for.
So we didn't really go
to them and say, give us
your easy problems.
We'll just make them faster.
We actually told them,
give us the problems
that you can't solve or
have not been able to solve.
Give us your hard problems
and help us understand them
to the point where we
can at least tell you
whether we can
solve them or not.
And we walked into
problems where
we had a certainty of maybe 50%
to think that we can solve it.
And in many cases
it is necessary,
because those are
the problems that
could justify the cost of
using a system like this.
And software again, is
not all created equal.
So we put these
specific words here
because thinking about
software in silos
also causes a lot of problems.
Most of our software
engineers were full stack.
They also were very experienced
in distributed systems.
And I'll talk to
you about that why.
And thinking about cloud
native is something
we had to learn along the way.
We started with systems that
were completely on prem,
and then moved to
the cloud thinking,
OK, we can more
or less modify it
a little bit to
get to the cloud.
Only when we started hitting
scales of the problems
that we had to
solve, we realized
that we had to go rewrite
a lot of different things.
So if you want things
that on that scale,
be willing to rewrite 30%, 40%,
sometimes 50% of your systems.
So what is
democratizing AI mean?
Because I thought about
this a little bit for this
talk, and I've been thinking
about this for a while.
It means different things
to different people.
Now if you're a data science
researcher, a data scientist,
or a practitioner who's been
doing this for many years, when
you hear the word
democratizing AI,
you probably think OK, I
know how to build my models.
I have my tools.
I'm going to
continue doing that.
I am very, very good about
using and tooling myself.
But I don't necessarily
know how to make
my models available for hundreds
if not thousands of people
repeatably.
So productionizing
a model is probably
what people think of when
you're a data science
researcher or an expert.
And by the way, I looked
up the word productionize
is a word apparently in--
If you're a domain
expert or a business user
or somebody who hears
a lot about AI, who's
heard about all the buzz,
maybe tried something
about on yourself, and find
that most of the tooling
is hard to use,
most of the jargon
is very hard to understand,
and many times you
can't connect a outcome that
looks nice to the outcome
that you care
about, then you want
to be able to build these
models fast and productionize
very, very fast.
Now if you're a
software developer,
if you're an
applications developer,
you may not necessarily care
about the specific nuance
of an algorithm or a specific
nuance of how the model has
been reduced to a size that can
be scaled and say, served well.
Somebody already did that work.
You care more about
how do I consume it
for the particular use
case you have in mind.
Now any of the tool
sets out there,
any other tooling
out there, people
who are building tools
for democratizing AI,
have to cater to all of
these three different very
different say, workloads.
And we've been doing
this for a while.
And mostly because of
necessity for ourselves,
because we have a
small team, which
has to scale to very large
problems, as well as very, very
large scale systems.
So we made some very
careful choices.
So being container
first, was by necessity.
And I will give you some metrics
on what going to containers
did to us.
So we did come up
with a philosophy
called depend on Docker.
That's a logo for it.
Obviously, we like doing things,
hence D-O, depend on Docker.
And everything we
run is containerized.
Except for raw storage,
everything is containerized.
And we also make
sure that that means
we have to orchestrate
at scale as well.
So we'll show you a demo of
a model and a set of models
running at scale today.
But before I get to the demos,
just what did it do to us?
So like, rewind
five years ago, when
we were trying to build
what we called at that point
in time, a digital
twin platform for GE,
we would build a
technique, we would
build a model that would
be in the research phase.
Getting that model
into production
meant that we had to
stick to a very, very
strong say, quarterly releases.
And every single
quarterly release
would go up to the midnight
before the release,
and something that was
working just fine the day
before would break.
And most of the time
it was some dependency
around the model side.
And the number of times
we heard hey, this
is working just
fine on my laptop
or is working just fine on
my system, it's something
wrong with the
production system, just
drove us nuts to be honest.
We moved to containers.
And five years ago moved
the containers was not
that popular.
We had to still face comments
like, well, nobody would ever
let you put run
containers in our system,
to today when everybody run
everything in containers.
It moved the
production deployment
of a model that has
already been validated,
taking six months
to roughly an hour.
So the model has already
been validated by somebody.
It's a data scientist,
a subject matter expert,
they agree that model going
from being in their systems
to a production system is
roughly an hour for us today.
Of course, it doesn't always
happen in one hour, every time.
But if the data
scientist was say,
disciplined enough to use
our tooling from day one
and actually have everything
running in containers, those
containers transmitting
into our production systems
is under an hour.
So that's really what
going to containers
and sticking with
discipline got us.
Now the demo use
case I'll show you
two different types of flavors.
The first one is a simple
case where I'm just
going to take a data set, run it
through some algorithms that'll
detect what is in this data
set and tell you what possible
models we can build.
On a UI, actually it will
build several different kinds
of models to get
you to a best model.
But the same demo will
also run and show you
that this can scale
across thousands
of models built live on stage.
And that was a promise
in the abstract as well.
And hopefully, if the
demo gods agree today,
we'll show the
demo working today.
To get that to
work, we obviously
need high speed storage.
But a lot of the core
services had to scale as well.
And we had to break it
down to a point where
we can control things that
are very fine grain level.
And if you see at the top
we have both GKE clusters
and self-managed clusters.
And in this particular
case, we did that because we
were experimenting with things.
But it actually came
back to save us,
because the recent update
from wonderful team
introduced a non backward
compatible change.
And that broke our
self-managed clusters.
But the GKE clusters are up.
So actually preparing
for this talk,
we were fine actually
going through preparing
as we were upgrading our
self-managed classes as well.
So I'll walk you
through how to do that.
But this is the cluster
we'll be showing
you hopefully live running.
We have a master node.
And what is expanded
on the right hand side
is just a master node.
The lines are not
necessarily correct.
So we have 100
nodes per cluster.
That's the GKE master cluster.
On the top are five
workload clusters.
On the bottom are five
self-managed clusters.
So I'm going to kick off
some model run times.
And that'll actually load all of
these different cases as well.
So before I go run and spin
up all of those clusters,
I just wanted to show you a
demo of what a subject matter
expert would do if he
didn't know anything
about what is in the data set.
So this is a AI system
called AI factory.
But I just come in and
drag and drop a data set.
The data set is simple
tabular data, as you can see.
It just has one header with,
in this particular case,
I think five variables and
about 100 rows of data.
And this is a data
set that we just
created to tell you how
do you predict a wear
and tear of a wheel depending
on how fast you're accelerating
in the car, what surface
you're driving on,
and things like that.
So what happened
here is saying, OK it
knows that it's a CSV file.
It wants to see if we can
build [INAUDIBLE] build.
Now going from that first page
to here, what happened was it
understood what is in the file.
And this is granted,
a small file.
But if we had a large file
with millions of rows,
it'll do the same thing.
It understood what are the
variables in the file that
are numeric, and also
went in and said,
OK, given that these are
the variables you have,
these are the potential
variables that you can
use as outputs in your system.
And that means that for
each of those outputs
it will go in and use the rest
of the variables as inputs
and try and build
your best model.
And it not only did that,
it went back and said,
OK, here are all the
different techniques
in the system they're
already preloaded that can
be used with this data set.
Now if you're a
subject matter expert,
you don't necessarily know what
all of these techniques mean.
You don't necessarily
need to care.
You just got here.
The system is
suggesting something.
Let it continue to
go on up to a point
where it built some models
and then you can evaluate
if the model is useful or not.
That's we are going with it.
Now if you don't necessarily
agree with all the variables
that it shows, for
example, if I don't want
to build a model
for friction, but I
want to build the
model for acceleration,
I could do that as well.
And then I go in and say,
OK build all 24 models.
Now a couple of things need
to happen for this to work.
Where it needs to
take this data set,
split that into
testing and training,
connect it to all of these
different techniques,
select all the
defaults, and then
come out with the best models.
Now in the time frame that
I just said that, it went in
built all those models
live, came back and said
it's telling us
what are the best
models for each kind of output
that the system told us.
Now some of these outputs may
not necessarily make sense.
For example, I don't
necessarily want
to build a model for weight.
But I want a model for where.
I want a model for
acceleration, for example.
And it also went in and said
out of the six techniques
it went and evaluated, for each
of the output, it's telling you
what's the best model based
on a particular root mean
square [INAUDIBLE].
It's a simple example.
I can add it at
scale here on the UI.
Now what if instead of
having six variables,
you actually had
10,000 variables?
Instead of building
24 models, you
want to build 10,000 models?
Or 100,000 models?
So we have a CLI
for that as well.
So let me switch to that and to
just walk you through this one.
I have 10 cubed contexts
running on the left hand side.
And each one-- so the
first left hand side ones
are GKE clusters.
So the GKE 1, 2,
all the way to 5.
On the right hand side are
our self-managed Kubernetes
clusters.
So that's the tower you saw.
So the left hand side is GKE.
Right hand side is self-managed.
And I'm going to just run
a simple watch command.
So I'm going to
just watch the port.
And most of it is telling you
that no researchers are found.
That means nothing is actually
happening there because I
have not done anything yet.
Now let me go in here and run
the case where I'm actually
spinning up the
same thing, where
we are simulating 10
different users running
1,000 models each.
So it's going to spin up
10,000 models simultaneously
across all of these
different clusters.
So if I had this cluster
and [INAUDIBLE] my fingers.
What you're going to see is
different colors going through.
So each color is one
particular Kubernetes cluster.
So it's actually loading all
of the Kubernetes clusters
simultaneously.
And what you're going to
see is on the left hand
side, each of the
clusters actually starting
to get loaded.
And you can see some
of them started.
And it's processing.
Each of the nodes are going
to process about 100 models.
If I go back to our system
that's monitoring all this--
So what you're seeing are
all of the 10 clusters
along with the
management cluster.
And what you're seeing
is the blue dots
are the management cluster,
the 10 Kubernetes clusters
are around them, and the white
dots that are popping up in
are the models that
are being built
live as I started building it.
And the total number
of services that
are running to
make this possible
is well over 1,000 services.
And the jobs-- you
can see the job count
is actually going up live.
So it looks like the demo gods
are finally smiling on us.
Now we did build another
case where 100,000 models
were running live.
And I wanted to show
that system as well.
And I'll come back and
show you these numbers.
And this is a case
where all 100,000
models ran and completed.
And I can go back to this case,
reduce the speed a little bit.
And adjust this size
also a little bit.
So this is what a system
that builds 100,000 models
look like.
And this was done live.
[APPLAUSE]
Thank you.
Now if I go back
here, you can see
the models are
still being built.
It's still being submitted.
And the interesting
piece here is
to get to this level of
scale, you can see all of them
have been submitted.
Kubernetes are still
responding and actually
kicking off the jobs.
So the load balancer
here not only had
to watch what is a
load on the system,
it also had to be
optimized to the level
where it's watching which
of these clusters, which
of these Kubernetes
clusters are having
a delay in the responding back.
So sometimes the Kubernetes
API takes two seconds
to give us a response back.
So in cases like that,
we won't necessarily
go load up that
particular cluster.
We'll move the load
to somewhere else.
So it's not just
that it's watching
what is happening
inside those clusters,
it's also watching Kubernetes.
Now you built all these models.
You want to know what
is the best model.
So you can run a query
across all of these models
that just got built. It
will come back and tell you
that it's not just one answer.
So you need to know whether
it's a best model based
on what root mean square error.
That was what we
showed on the UI.
But you can also say that
is the best model based
on max percentage error or
average percentage error,
or any of the different
percentages errors
that you care about as a data
scientist or a subject matter
expert as well.
And this is coming
live from the system
that is actually running there.
And you can see that jobs are
continuing to be submitted.
There you go.
So we did show you real
world use cases that
require a lot of great scale.
And getting to
scale requires you
to think slightly
differently, and also
think about scale from day one.
Thinking about scale
at day 10 or day 100
causes a lot more angst than
thinking about it on day one.
And we had to learn
that painfully
on many of our
applications as well.
But what we showed you today
was using Docker, Kubernetes,
Kubeflow, GCP, NGC
containers in a way that
is significantly scalable.
And it's definitely
scalable, but it takes
quite a bit of effort as well.
So I want to end
this with credits
from the team that's
listed here and many, many
more who worked tirelessly for
several years to get us here.
So thanks a lot, guys.
Thank you.
[MUSIC PLAYING]
