[MUSIC PLAYING]
ARUN SUBRAMANIYAN: OK.
Good afternoon.
Thanks for coming.
I'm Arun Subramaniyan.
I lead the data science
and analytics team
at Baker Hughes, a GE company.
We'll talk today about how we
deal with analytics at scale
for the industrial world.
Specifically for oil and
gas, but more in general,
what are the scalable
aspects and what
are the commonalities
of analytics
in the industrial world.
And just to give
you a little bit
of an introduction of
the kinds of problems
that we're interested in--
and by we, I mean the larger GE,
so it is the larger GE company.
So for those of
you who don't know,
every two seconds an
airplane takes off
somewhere in the world
with a GE engine on it.
About 30% of the
world's power is
generated by a GE turbine
of one kind or the other.
And we formed Baker
Hughes, a GE company,
last July by merging GE oil
and gas with Baker Hughes, two
companies which, by their own
merits, were large [INAUDIBLE]
their own merits.
But the merger brought
together the first,
world's first full-stream
oil and gas company.
By that, we mean
the first company
in the world that can
do exploration for oil,
extract oil, ship the oil,
refine it, and then do
something with the
refined products.
So make energy, make something
else with the products.
The only company,
the first company
in the world that
can do that at scale.
Now, to do any of these
individual things by itself
is a big feat.
Doing all of these things
simultaneously in one company
requires us to build a
lot of models repeatedly.
And I've not even
touched upon what
we do in the health care space
or in the transportation space,
for example.
These are just simple examples.
And to do this at
scale, daily we
have to build
thousands of models.
And for us, models are
any representations
that connect a set of
inputs to a set of outputs.
Now, that can be as simple
as a simple equation,
or a full-blown partial
differential equation that
takes about two months to
solve on the world's fastest
supercomputer.
All of those things
for us is a model.
Now, to do this
effectively, we, of course,
need the raw material
for it, which is data.
And we deal with
data in two ways.
So we have what we call two
big dichotomies of silos.
The first silo is data.
The second silo is
analytics, or second silo
is subject matter expertise.
I'll first tackle
the data silos.
And if you look at across the
entire industrial spectrum,
the first one, of course,
is oil and gas here listed.
From there to
aviation, say, take
an aircraft that takes off
from New York, lands in London.
The aircraft was designed
about 20 years ago.
The engines are not
collecting that much data.
That's roughly about a
terabyte of data generated.
Not much of it is stored,
not much of it is analyzed.
But today's newest aircraft
generate about a terabyte
per hour of flight.
Take that in the oil and gas
world, where an offshore oil
rig, for the past 30
years, has been generating
roughly 15,000 to 20,000
tags worth of data every two
seconds.
And almost all of
it has been stored.
Compared to the power or
transportation or the aviation
industry, the oil
and gas industry
has at least a couple
of orders of magnitude
more in terms of data,
primarily because they
had to deal with
uncertainty from the get-go.
There's very little you
know about the reservoirs.
There's very little you can
actually model or control
about what happens subsurface.
Whereas an engine,
it's complex, it's
extraordinarily
sophisticated, but you
control almost everything around
that particular [INAUDIBLE]..
So the uncertainties exist,
but they're far smaller.
So the data sets that you
see in the oil and gas world
significantly outstrip anything
else in the industrial world.
But if you compare that to any
data sets in the ERP systems--
traditional ERP world-- for
the last 20 years, the last 30
years where you think about
any transactional data
sets, any kind of
maintenance records,
any kind of
information that you've
been sharing and
storing and retrieving
in the traditional
enterprise world
gets dwarfed by the industrial
data set by at least two
to five orders of
magnitude, both in terms
of volumes and velocities.
And there are no systems that
have been developed that would
tackle something at that scale,
even the modern industrial--
or the internet 2.0
kinds of technologies,
where we track everything that
everybody's doing in this room,
everybody that's doing
anywhere in the world--
where they are,
where they're going,
what they're probably
going to do--
it still gets dwarfed
significantly.
And I'll give you some examples
of how we go about solving it,
but this is just one
side of the story.
The data silo is one
side of the story
that everybody talks about.
The other side of the
story is the analytic silo,
or the subject
matter expert silos,
where people have been
building analytics for decades.
People have been building
analytics for 50 years.
If we have the luxury
of going and building
a model from scratch, from
a data set that you still
have to prepare painstakingly,
that's a great thing.
You have a lot of frameworks
out there today to go build.
But if you had models that have
been in production for decades,
that people understood, have
been using it in production,
today folks who are using
it may not necessarily
understand all the
nuances of the model.
You don't necessarily
want to go touch it.
You don't necessarily want
to go tweak it or change it.
You want to be careful about it.
The same thing in the financial
world, where you really
don't want to go change
the financial metrics
without understanding what went
on for 20 years internally.
Very, very similar things here.
The only difference is the
consequences of doing something
wrong can be catastrophic.
Imagine we have to decide,
based on a model's prediction,
whether they have to
shut down an offshore oil
rig, which is producing about
2 million barrels a day.
How to predict whether
a particular engine
or a particular aircraft needs
to be pulled for maintenance
this take, before
this takeoff, or is it
OK to leave the airport today?
Consequences of a wrong
decision are catastrophic.
And false negatives
are really not OK.
So there are no models
that we generally
tend to build were
false negatives are OK.
Our false positive rate-- so
that's why generally high,
roughly around 20% to 30%--
would be OK for a model
in the industrial world.
But a false negative
rate, meaning
the risk of me saying something
is OK when it's really not OK,
has to be zero, or as
close to zero as possible.
That's where the
consequences come in.
And analytic silos
exist because of that,
because it takes a lot
of expertise for somebody
to build and maintain
these models.
Now, for us to come
in and scale this,
if I have to
democratize analytics
for 5,000 engineers in a
company like Baker Hughes,
for example, I can't
necessarily go in
and say, drop
everything you're doing.
Here's the brand new framework.
Go build all your model to
the brand new framework.
Some would, but most
of them would still
go back to Excel spreadsheets
and, in some cases, therefore
[INAUDIBLE].
But that's what works,
and they know it works.
So we need to somehow
bridge these silos together,
both the data silos
and analytic silos.
And fundamentally, getting a
federated orchestration engine,
where you have a lot of these
models talking to each other,
not necessarily independent
islands, but actually
models that talk to each
other is fundamental
if you want to scale
analytics in industrial world.
And I'll show you
some examples of that.
The example that
you see here where
there is a Python model that
gives an output that gets
fed into a Fortran 77 model,
which we can only probably
run as a black box, that
then talks to a model that's
written in Go, that then talks
to a model that's written in C,
it's a daily reality for us.
It's not an esoteric example.
In fact, there are
very, very few examples
where we had to build a model
that gets you to an outcome.
That's just it in one framework.
Now I've listed some
of these things.
Some of these things
could be provocative.
Some of these
things you could say
are myths or reality,
depending on your point of view
and what your
experiences have been.
But we've heard all
of these things.
And we talk to customers
on a daily basis.
And in respect to where
they are in their journey
for digitalization.
Where they are in their
journey from moving
to the Cloud or not.
Whether they believe in AI, or
whether they think it's a myth.
A lot of these things are
what customers tell us daily.
Now, there is some truth to
a lot of these statements,
but they hide the
bigger truth, right?
So for example,
one of my favorites
is when people say neural
networks are old news.
The general statement is, I've
done this in the '90s, when
I was doing my masters.
They're really bad.
They got, what,
about 60% accuracy.
They really never got
anything meaningful
in engineering parlance,
so we moved on.
What's so new about all
these new neural networks
that are coming up?
And the other thing
I didn't list here
was people saying, oh, I've
done deep learning 20 years ago.
And for those of you who
do deep learning here,
you know, that's really
a silly statement.
But the reason why people
walk around with these myths
is because they haven't
gotten the outcomes
that everybody seems to be
promising with analytics.
And by the way, the
analytics promise
has been the same for
the last 25 years.
That used to be the
digital well of the future,
and that actually came out in
the late '80s and early '90s,
But they wanted to go sensor
up the entire oilfield and say,
this is going to create
a huge revolution.
That's like me coming
and telling you,
email is going to make
you more productive.
So it just has to be balanced
with what, and how, you're
actually applying things to.
And just to bring
home the point,
I wanted to walk you
through a few examples.
And one of the favorite things--
and just full disclosure,
right?
So my background is, through and
through, aerospace engineering.
I grew up in the physics world.
We did data science
as a necessity.
We got into doing large-scale
data science because physics
alone was not sufficient.
But the general pushback from
more subject matter experts
is the world has been run from
[? biophysics. ?] We understand
it.
That's the only thing
that can solve all kinds
of complicated problems, right?
So if you were to
predict, purely
from data, where a
particular stone would
land if you threw it, and
you don't know anything
about gravity, the only
thing you need to go by
is observations of each of the
stones that are being thrown.
If you had only purely
built that model,
this is how it looks like.
So the projectile is going.
We don't know anything
more than saying
it's just a linear projection.
If this is the first
time I'm observing,
it takes me more than
10,000 observations to track
that the stone is even
going to come down.
I don't know anything
about gravity.
And even the simplest model,
the most sophisticated model
you can build, requires
a lot of observations.
On the flip side, if you're
a high school student
and you're given
the same problem--
see here, stopped working--
you will tell me that all the
data you need is just one data
point.
All you need is the
velocity, the speed,
and the angle at which
you threw the stone at.
Anywhere in the universe,
where there's gravity,
you can predict precisely
where the stone will land.
Right?
Even middle school physics
will tell you that.
The problem really is reality.
If I didn't know exactly what
angle I threw the stone at,
I just added 1% uncertainty
in initial angle,
my output changes
were about 20%.
If I had a 5% uncertainty in
wind speed, all of a sudden you
have 100% uncertainty
where the stone will land.
And this is the simplest
physics problem you can tackle.
It's just one variable
and just very,
very little uncertainties.
Imagine if you
have to go predict
whether a particular engine
is going to have a fault,
or have to shut down
an offshore oil rig.
Massive problem,
thousands of variables,
lots of uncertainties, lots
of things you don't know.
You still have to
make a decision.
So physics, alone, can get
disastrous significantly.
In fact, in this problem, if
we add a few more uncertainties
it'll tell you the stone
is flying backwards.
So we have to be very
careful with that.
And the other thing
that people generally
don't like to talk about is the
fact that yes, data is cool.
It's everywhere.
Everybody likes to use it.
But the reality is not
much of it is useful.
So this is a real use
case where the data
set is from an actual asset.
Think of this as a
very large scale asset.
It could be a jet engine.
It could be a gas turbine.
It could be any of
those kinds of things.
We are the equipment
manufacturer.
We, meaning GE, [INAUDIBLE] GE.
We have all of the data sets
about how it was operated.
So the top left gives you
the operating condition,
one of the variables
that are being
measured for the entirety of
this particular assets life.
The actual data there
is in blue dots.
Everything else there is what
we estimated, or generated
through other sources.
The reason we had to do that was
because here in our databases,
systems of record, it says
the asset started operating
in 2001.
That's their
operating condition.
But it actually started
operating in 1994.
And there's lots of
things that happened
to it before our systems of
record actually caught up.
And if I had to go predict
anything about the system,
and I say, I only
go by the blue dots,
it doesn't matter how
sophisticated my techniques
are, how sophisticated my
data scientists are, or even
the physics-based
modelers are, the models
would still be wrong because
data is only about 60% correct.
So we had to go correct for it.
And while we started digging
into this over the last decade
or so, we realized
it's not just--
missing data is
all not made equal.
We actually had to go invent
different ways of categorizing
missing data.
So that's why you have
partially missing one
to-- fully missing one, two,
three, and things like that.
And each one has to be estimated
using different techniques.
So I imagine the data
scale we started with.
That includes the missing data.
If we add on the augmentation
activity that goes on
to add the missing data
back in, that problem
becomes two or three times
worse in terms of volumes.
Because the minute
you go add data in,
and if I change the color
coding to all be just blue,
it would be very hard for you
to tell which was estimated
and which was real.
And you've got to figure out
which models would use which
version of the data
sets that you're using,
and how do you keep
them all in sync?
That becomes a massive
problem as well.
And just to show you the quality
of how well we can actually
estimate, the top right
figure, blue histogram,
tells you the actual data set.
The mean is around 559 with a
standard deviation of about 33.
The estimated values are very,
very close, 560 versus 31.
So it's very hard for
someone to tell after a while
whether the data set was real
or not if you didn't properly
tag them.
And this happens regularly.
This happens on
an engineer's desk
where they will change something
in the data set, passed
on the data set
to somebody else,
they built a model that had
a significant consequence
and they pass it off.
And there's no systems of
record for any of this.
And that becomes the
actual data of record
as well moving forward.
So very quickly, industrial
analytics, for us,
is a combination
of three things.
People have been doing analytics
for a long time combining
domain knowledge with
standard software techniques.
The new kids on the block are a
new way of doing data science,
mainly deep learning based
data science, and new way
of building software
on the cloud micro
services self-scalable.
The traditional solutions are a
combination of domain knowledge
with software.
I would say finding
very, very hard problems
or finding solutions
to hard problems,
like finding a
needle in a haystack,
is a combination of modern data
science with modern software.
Predicting what I'm going to
buy before I actually go and buy
it, hard problem.
Predicting how long it will
take me to go home before
I actually tell you
when I'm going to go
home, again, a hard problem.
In a lot of cases,
it's been solved.
A lot of cases, it's
already happening today.
The case in point for
industrial analytics
it's really a combination
of all three things.
Because it's not even finding
a needle in a haystack,
it's finding a needle
in an ocean of data.
Because most of
the times, the one
thing that you're
trying to go predict,
you don't have any data for.
So if you're trying to
predict for example,
a [INAUDIBLE] with
a batting fail,
you would have hundreds of
sensors on a gas turbine,
but you don't have a sensor on
a batting that's about to fail.
Many times, the
sensors that you have
are all confounded and
not necessarily calibrated
in the last 20 years.
So how exactly do
you make sense of it?
To predict something
that you're going
to go to predict without
understanding the sensors
around that particular piece?
That's really the problem that
the industrial world faces.
And that is not just one
esoteric problem, not just
one problem here [INAUDIBLE].
Almost every problem of
value is a combination
of all three things.
And we got here, not as an
epiphany moment where we just
sat there and said
this is the hardest
problem we need to solve.
These are the problems we
have to solve every day.
These are the
problems our customers
have to solve every day.
If we could solve it with
traditional techniques
we would.
If we could solve with just
purely with data science,
we would.
Because that's the
easiest thing to do.
There's a lot of
techniques out there.
There are a lot of
systems out there
that'll let you go do that.
The reason we had to go combine
all three things painstakingly,
because that's the only
thing that actually gets us
to a solution that's meaningful.
Now we talk about
digital twins a lot.
And the world talks
about digital twins a lot
today to the extent that
it's almost an abused term.
For us, digital twins
are very special.
We get there because we
can't solve these problems
with any of these individual
techniques by themselves.
And the way we define it is
it's a digital representation
of a physical system on
an asset or a process
that has a few characteristics.
It has to be live, it has to
be continuously update-able,
and it has to be adaptable, and
scalable all at the same time.
Now, imagine I have to
do something specific,
or I have to make something
specific and scalable
at the same time.
Meaning, I go build a
model of a particular asset
or a particular system.
I make it so specific
that it's specific
to that particular system.
It doesn't scale by definition.
So how do you go build
something that's both specific,
and scalable?
Even worse, I made it specific.
OK, I can somehow
make it scalable.
I now have to go
make it adaptable.
Meaning, if I go
change something
in the physical system,
I change a pump out,
I change a compressor
out, I should
be able to change
only that portion
of my entire digital twin
and make the whole thing
work again.
I don't have to go back and
rebuild the whole thing.
Those are very, very
hard things to do.
And the reason we
have to go do it,
is because we have to run
digital twins thousands
of times a day for every
single implementation we
have out there.
So it's hundreds of thousands
of digital twins running
thousands of times a day.
So we can't really
run these at scale
if we have to go do
these things manually.
Typically, how these
things are done,
you get a team of six to 10
PSTs with some designers who
are experts in the
system, let them loose
for about six months,
and they come up
with one model that's specific,
but not scalable and not
adaptable.
And the reason why we have
to go mix all these three
things together, so we
always start with physics--
that's physics-based so that
we can actually build on top
of it--
where physics allows
us to do known knowns.
So if you know something
is going to be a problem,
physics will get you to a
precise solution very quickly.
Probablistics, because
that's where we started with,
with our uncertainties I really
cannot predict the real world.
Deep learning allows us to
really sleep reasonably well
at night because it
allows us to go and deploy
these systems to
watch for things
that I know are
going to go wrong,
but I don't know what they are.
I know something is
going to go wrong.
That's all I know.
I don't know where.
I don't know when.
So those systems allow us to
go watch and adapt the models
for that.
So I'll give you a
few examples of that.
I'll make it a little
concrete for you
for what a digital twin is.
If somebody comes
and tells you they
have a digital twin of a well.
And in this
particular case, this
is a well with an
artificial lift system.
Think of that as a pump.
Its an electric submersible
pump that's about two miles deep
under the earth.
When we say its a
digital twin of a well,
we connect it as a water model--
that's the IPR curve--
to a pipe model,
to a pump model,
to a tubing model
that finally gets you
a pressure differentiable
at the end.
And what you're
trying to predict
is how do I change the
operating conditions
of the pump and the well to
make sure that I produce as much
as possible without
damaging anything,
both the well as
well as the pump.
And to do this, roughly
one digital twin
has about eight models
running in sequence
along with an iterative
loop internally.
So it's not just
a linear sequence.
And this, we would consider
a fairly moderate level
of complexity digital twin.
Now the way we use it is an
operator would go in and say,
OK, how do I improve
my production
from my set of wells?
To answer that
question on one well,
I typically have to
run a digital twin
at least a few thousand times.
And to get a plot
like this, where
I have a plot of power
versus total liquid flow
from that particular well by
changing all the parameters,
all the gray dots, each
one of those gray dots
is a prediction, or an inference
through the digital twin.
The best family of solutions
are on the blue dots.
The problem is, today I might be
operating on the black diamond.
Without knowing what my space
looks like, I might just
randomly increase power and
hit any one of those red dots
on the top where I
increase power, meaning I'm
spending more, and I'm
actually producing less.
This happens a lot
where intuitively you
can go tweak some
things, you really
don't know where you are
on your design space.
In this case, it's a
fairly simple problem.
We're only changing
about six parameters.
But if you're in an oil field,
there are a few hundred wells,
you're automatically going
into a few thousand parameter
optimization problem
very quickly.
And that needs to happen
a couple of times a day.
But today, the best you
can do with the tools
out there, not just
in oil and gas,
but in aviation and power or
any one of these industries,
is to run an optimization
like this, not even
a global optimizer,
but a local optimizer.
It can take anywhere
from 20 minutes
to about two hours for an
advanced engineer to go run it.
So we took this
as an experiment.
And about summer of last year we
went and built the digital twin
first for a single well.
It took about three
minutes per well.
We scaled it with the
techniques we talked about up
to now, combined physics, data
science, and software scalable
on the cloud.
In about six months, we
got 1,000 x improvement
going from a single
well optimization that
took about three minutes to
run to a cluster of wells
where about 20 wells
within a cluster, you
can optimize it in
about 30 seconds.
The reason why you get
1,000 x improvement
is because it's not a
one-shot optimization.
Its a game of whack-a-mole.
So you go do something
else in one well,
something else goes
wrong in another well.
So you fix, go somewhere
else, change something,
something else goes wrong.
So you do that iteratively
multiple times.
In fact, to get
to that solution,
you need it run the
model about 20,000 times
to get to that answer.
That's really where
you're getting
the 1,000 x improvement.
But we wanted to push it
one level higher saying, OK,
we keep the models the same.
We scale the software.
We push it to the Cloud.
We actually orchestrate
all these models
so they're on the
Cloud and see what
actually can be done in a
field-level optimization.
The solution
doesn't exist today.
And to be honest, none of our
customers even asked us for it.
In fact, when we
go and presented
the 20-well optimization, the
response was, this is great.
Today, we do that
once every six months.
This one might let us do
it once every two weeks.
And the field-level
optimization, the response
was, we really did not know
that this is even possible.
And their response
was actually opposite.
They came back and said,
can you do 5,000 wells?
Can you do 9,000
wells simultaneously?
Because the pace at
which this is moving,
the six month deadline
was just because we
had our internal conferences.
We actually had it scaled
in about three months.
I'll show you how we
walked through that.
And in just a year,
for an engineer
to go from maybe doing a
single-well optimization
every six months to doing
a field-level optimization
a couple of times a day is
a phenomenal improvement.
[APPLAUSE]
OK.
And the notion of
using deep learning
in traditional industries
like oil and gas,
or aviation, or power,
they're very risk averse.
They don't really want to
trust the model by itself.
So here's some of the
things that people generally
talk about anomaly detection.
But the things that resonate
with industrial customers
is people don't really
want to go set routes.
Most of anomaly
detection systems
out there want to tell
you to come and say,
here is my threshold
for a set of variables.
Set my thresholds,
and after that, I'll
go tell you when something
is going to go wrong.
And this has to be done a priori
before it goes and catches
things.
The biggest promise of
automated anomaly detection
with deep learning primarily
is you can actually
catch anomalies without
necessarily knowing what
these anomalies are beforehand.
I don't have to
train for anomalies.
I don't have to train rules.
In fact, I don't
even need to tell it
what are the sets of inputs
I need to be looking at.
As long as you have a data
stream coming in, and as
long as you knew what the
normal state was, I can go
say what was abnormal.
That is the first foray
for us into deep learning.
But from there, we actually
went into deploying
a live application for a
customer where the customer was
recording about 10,000 sensors
off an offshore oil rig.
Actually, this was
done by the customer
who actually already pushed
the data to Google Cloud.
All we had to do was train with
a six month data set they had,
which was normal,
with no abnormalities.
And within about a six week
period, starting from scratch,
going from not knowing what
the data is, not even knowing
what the asset was, to having
an anomaly detection working
in an application roughly
took us about six weeks.
And this particular
application that you see here,
this failure mode
that was caught,
that's shown in the top
right where it's just
an anomaly score,
has a combination
of about 16 different
sensors-- record an anomaly.
Then the next immediate
question is, OK, this
is something going wrong,
but is this just a blip,
or can you tell me what
is causing this problem?
That's the blot on
the bottom right.
And I'm not sure if it's
showing it properly here,
but the colors actually
indicate as a function
of time, which
variables are actually
affecting the anomaly score.
So you can actually start
going and figuring out
what is causing the anomaly.
That's the next step.
The step after that is
actually figuring out
what is the actual cause, what
is the root cause, actually.
And to do that, you actually
train with maintenance
logs along with this time
series data sets as well.
So this is actually
an autoencoder
running for a set of about
24 different variables
over a period of two years.
Built, running, live on
TCP in about six weeks.
The other kind of
systems that are actually
fairly effective for
using in deep learning
is image recognition
kind of use cases.
So this is a use case where if
you're driven from here to LA,
you would have seen
a lot of these pumps.
These are called rod lift pumps.
And the biggest problem
with rod lift pumps
are they fail quite often.
And their failure
modes generally
are categorized using
these images, which
is a plot of the force where's
a displacement of the rod that
keeps going in and out.
And depending on how
that shape looks,
an experienced
engineer can say what
is the actual failure mode.
The problem is two failure modes
look more or less the same.
So it depends on
who you ask and when
you ask in terms of finding
out what exactly is available,
until you actually have to stop
operations and pull it back up.
And whereas if you use
traditional techniques to do
distance-based metrics to say,
this is the overall graph.
Tell me what this
new graph looks like,
and what is it
closest to, you get
to an accuracy of about 60%.
So we took the latest test and
image object recognition model,
augmented it using transfer
learning with just the failure
mode images that we
had and let it loose.
So if I run this
demo for you, you
can see most of the detections
where on the left-hand side
is the data set that's coming
in for a particular pump,
for a particular model of pump.
On the right-hand
side is what the model
is predicting real-time.
So this is a deep learning
model built and deployed
for inference on, in
this particular case,
the times are actually
running on an edge device
with a small [? Jetson ?]
with the timing
that we're actually running.
So you can see
instantly it is catching
with under one second delay
what the actual failure mode is.
And this model was really
not that optimized yet.
We just took the standard model
that was published, augmented
it with the data sets
we had, and accuracy
went from about 65% to 93%.
The data sets were
not even that high.
So at only a few
hundred failure modes.
And if you optimize the models,
you can get up to about 93%
to about 97%.
The last 3% is
something that you have
to go collect more data for.
So these are some
of the examples
that we are building and
deploying in the field,
both with just traditional
deep learning, augmenting
with deep learning.
The other kind of
deep learning models
that we are building
quite effectively
are the knowledge systems.
And the request for this one
came from a couple of customers
where the request was, we
have a lot of documents
that we have generated over
the last 30 to 50 years.
We have a lot of
training material
that we generally use
for our new recruits
to be trained on our
operations and to make sure
that we can actually
get them up to speed
very quickly with our
subject matter experts.
What most of the
industries are facing today
are a dichotomy of problems.
The first problem
is that they're
losing a significant portion
of their experienced workforce
in the next five to 10 years.
On the other side, the new
workforce that's coming in
are used to learning
and interacting
in a very, very different way.
They don't necessarily
want to sit
through three weeks of training
on just going through and doing
one task.
They want to be
able to go learn.
They want to be able to go
search and find information
as they need it.
The problem is, to
enable that and to bridge
the gap between people
who are retiring
and the people
who are coming in,
you need a knowledge system that
can quickly learn and adapt.
So in this case, we built
up the knowledge system.
So we had the standard
English language NLP models.
We took those and we trained
those models to understand
technical English.
So how do you go
interpret tables?
How do you go interpret
tables and figure captions?
How do you go get a
natural language query
to return an answer that's not
a pure retrieval, but actually
something that generates text?
So we had to go build that.
And after we built
that, we realized
that the corpus of data that
we had to go to build it,
had to be expanded with all
the subject matter expertise
we had internally
in Baker Hughes, GE.
so we made an
English-language-speaking model
understand the
technical language
based on all the technical
language documents that
are out there in the world.
Then we made that model
into an oil and gas person.
Actually, we made that
into BH-GE person,
because it understood
our language.
Then we took that model and
tried to apply it to a customer
and realized it was like
a BH-GE person talking
to a person in BP or Shell.
We can understand each
other, but then we
don't know our lingo exactly.
We have our own acronyms.
So we had to go get the
documents from the customers
to augment this model as well.
So just to get to a system that
understands a natural language
query, respond to it in
the customer's context,
to get to, for
example, a user input,
going in, having a spelling
correction that's oil and gas
specific, and customer
specific, and go through
and give recommendations based
on what answers could be given.
Because if you go and
ask a general question,
you can of course go to Google
or any of the search engines
and then ask a general
question, you're
going to get a very,
very generic answer.
What they really
want is an answer
that's specific to
the person who's
asking the question in the
context of the company that's
being asked.
So to just get this
one simple example
working where somebody went
and tried typing in "gammar,"
so if you go try
that in Google today,
it'll get corrected as
"grammar" or "grammarly" or any
of the many English
language corrections.
What they actually wanted was
a "gamma board" or a "HT gamma
elite."
That is specific to that
particular customer.
They also wanted "grammarly"
and other things,
but they would be
ranked down, much
below what a general purpose
search engine would give you.
Imagine extending this
out where you just
go start entering natural
language queries and a report,
an actual full blown
report gets generated,
for you to go train yourself
for the query that you entered.
That's really where the
oil and gas language
model is getting used.
In fact, one of
the models that we
built which would require
significant compute
is the oil and gas
language model.
And we continue to
evolve that as we speak.
And we started, of
course, with English,
but because we are an
international company,
and also the corpus
of understanding
about any of the industrial
world is not just in English.
We actually are starting
to build translators
from English into all kinds
of other languages as well.
OK.
The other thing we
wanted to talk about
was democratizing analytics.
Now we talked about data silos,
we talked about analytics,
or subject matter expert
silos, but there's
also the other problem that
we constantly come across.
Hey, this is working on
my laptop or on my system.
I've shipped it to
DevOps, or a CloudOps,
or another analytics person.
Now it's your problem to run it.
It just works fine on my system.
Containers solve
that significantly.
But we had to upscale or
up-level our analytics folks,
because they come, not just
from software or programming
backgrounds, but they come from
deep engineering, and physics,
and domain expertise
backgrounds.
So we just did a quick
check of how long it took us
to onboard somebody.
So usually, based on
a person's background,
it would take anywhere from
a week to a couple of months
to fully onboard somebody
to be productive in building
new analytics and pushing it
into our production systems.
And that was before containers,
before we had CICD pipelines,
both on the Cloud or
on somebody's laptop,
it was more or less
mirrored the same way.
So it's not just about
upscaling, it's also
downscaling.
How do you downscale
your overall system
so that somebody can do their
work in their small ecosystem
and then push it back into
the overall systems as well?
So we went from somebody taking
anywhere from a week to two
months to a couple of hours.
And these were all tested
on a variety of people
where you had a lot of expertise
with containers and almost
no expertise in containers.
In fact, we had people who came
in with absolutely no expertise
in containers or
programming who still
got up and running in the first
day to push production code.
The other one was
we went through
painful quarterly releases.
It used to be nerve racking
sitting right before midnight
because we had already
promised our leadership that we
are going to release today.
And something that was working
just fine the day before,
or even in the morning,
would break just 10 minutes
before production.
We've all been there.
And it was painful, painful
quarterly releases to today,
we can actually do
continuous releases
from a person who has built
analytics, who does not
necessarily understand the
overall working mechanics
of a large-scale
enterprise software system,
pushing something, making
it into production,
and actually being OK
pushing it into production.
And we can do that in
some specific cases.
Of course, I'm not claiming
we do that in all cases.
But for the capability
for the analytics person
to write code, and the same
code to be shipped to production
is fundamental if we
have scale any of this.
Now I'll quickly run
you through a demo.
The demo I'm going
to run through
is trying to build a few
thousand models rather quickly
on GCP.
And to enable this, we built
almost all of our services
using Cloud native
Google services.
And we have an internal
tool that we also
use as an analytics
engine for all
of our products for the
customers called Applied AI.
We are running Applied
AI on GCP core services.
So it's our core running, but
using the GCP core services.
So we are using the
Kubernetes engine.
We are using, say, all
of the storage services
that Google provides natively,
including all the perimeter
security, DNS, Cloud DNS, and
firewall rules and everything.
OK.
And hopefully the demo
gods are smiling on us.
Let's see if this works.
OK.
Can we switch?
OK, perfect.
So what you're
going to see here,
you can see we have
a Kubernetes engine.
We have the cluster running
that's fairly large.
Now we also have both a
browser-based interface
that we won't be demoing today
for citizen data scientists,
or subject matter experts, to
go in and build a lot of models.
But we have a CLI that
a more advanced person
can go build a lot of models
very, very quickly as well.
So if I do my CLI on
the left-hand side,
you can see I can
do a lot of things.
I can run, log in,
run a few models.
I can check the
status of the models.
But we can also do guided build.
And guided build is
when I don't necessarily
know much about my problem.
I was given a data
set by somebody.
I need to go build a lot of
models with that data set.
I don't necessarily know
what models I need to build.
I don't know what techniques
are going to be useful.
I don't know if there is a
lot of missing data in it.
I'm not given much information
about that data set.
And that's a very
standard use case
as well where somebody
gets dumped with a data set
and say, go do something.
Or there might be an
outcome, but they may not
say how to get to an outcome.
So about 60% to 80% of
subject matter experts time
goes into figuring out what
do I do with this data set.
So this is trying to
figure out how do you
reduce that 60% to 80% of
your time to maybe 20% to 30%,
rather than spending 80%
of your time just wallowing
in the dark.
Now we already
logged into the CLI.
And if I go into our
actual Kubernetes engine--
as usual, the token has expired.
Let me make sure I get my token.
OK.
So we've extended Kubernetes
to have a graph tab that
will tell you what
services you are running,
how many replicates
you're running,
what boards are running,
what tasks are running.
And we had to this because
without them we really
couldn't go in and
figure out how to debug.
Not only how to debug,
really what is the scale
we are running at.
The numbers that come out and
you have to go do looking,
looking at logs takes
phenomenally longer time than
for us to go look at something
like this and get to an answer
quickly.
Now these are all the services
that are running already.
These are the core services
that you saw in the architecture
diagram, where this is just the
Applied AI running and waiting
for somebody to
issue any commands.
And what you see here
in this big flower
is all our task services.
So we have 50 services running,
waiting for any of the tasks
to come in.
If I go in here and
say, AI guided build,
and I'm saying,
use a local file,
and saying, out of all
the different combinations
of inputs you can use, use the
top 10 inputs that the model is
automatically selecting has
to be the most important
as a first pass.
So this file, the data
file, has 1,000 variables.
They're all correlated
with each other.
That's a real data set.
And you, as a data scientist
or the subject matter expert,
can go in and figure
out which variables
are important before going
and building a model.
What I'm asking is, go
build all possible models
for all possible
combinations of variables,
selecting the first 10
important variables per model.
That's what this
command would do.
Hit Enter.
What it's doing
is it's going off,
and in this particular
case, it has 200 variables.
It's running 1,056 total models.
And you can see it's pushing
off all of those task runs
to the Kubernetes engine.
And as soon as the
Kubernetes engine
gets all of these tasks
launched, it'll then send back.
And you can see the
graph already updating.
It's actually faster
on the screen here,
but each of the color
nodes that are coming up,
each one of those
nodes are actual models
that are getting spun up.
And as we are speaking, we
already spun up 693 models,
and it's live
updating as we speak
in terms of going off and
pushing it to GK live off
of the conference Wi-Fi here.
So as we are speaking,
we just took a data set,
pushed it out without knowing
what is actually in there,
asking the system
going and figuring out
how many models can you build?
What are the combination
of inputs that'll actually
work for each of those models?
Go build it, and come back
to me and show me it is up.
And all this running
live on a system
that's live on Google Cloud.
And in this
particular case, it's
now already finished
building about 730 models.
And the responses that
you're getting back
on the screen there
are running live
with all of these containers
getting spawned off
inside the Kubernetes
engine with all
of the connections between
the different servers
is running on a cloud.
Now can we switch back to the--
so what just happened
here was this data set.
When we took this data set
with, in this case, 200
different variables, with a
combination of correlations
between them-- the
plot you see here,
each color is a correlation
between one variable
and another variable.
The system went in, looked
at the correlations,
looked at a lot of the other
heuristics in the data set,
figured out what models
to build in a variety
of different techniques,
ran the models,
and came back with
an answer for you.
Now to run this scale of
the nodes that ran live
in a few minutes, the 50
nodes running 4,800 CPUs
with about 30 terabytes of RAM.
Yes, you just spun it up
for about five minutes,
came back down.
But you saved an engineer
maybe two weeks worth of work.
And they can actually
go do work on what
is more meaningful for them
to go ask questions and get
answers from, rather
than figuring out
what is this data set
actually telling them.
And we have a lot of
workflows built in internally
for us to go get the
acceleration off analytics
possible internally as well.
Now we did run this, not on
Wi-Fi, but actually [INAUDIBLE]
with a proper
internet connection.
We could run 30,000
models live, loading up
the Kubernetes cluster that I
just showed you with about 60%
utilization.
OK.
And the graph actually
looks like that.
It went live in about
2 and 1/2 minutes,
spinning out all the
30,000 models [INAUDIBLE]..
The nice thing about this is
not all 30,000 actually succeed.
A lot of times when you try
to push 30,000 models through,
a cluster that is of this
size, a lot of those things
fail that have to be
restarted automatically.
And having GKE, having the Cloud
native services automatically
restart helps us
tremendously as well.
We didn't have to really worry
about the infrastructure at all
in this particular case.
So I just want to leave
you with two thoughts.
One is, we're not
experimenting here.
We're not really saying this
is an abstract thought maybe
that will be available in
the future in five years time
or something like that.
We're saying industry
is using this today.
A lot of these
applications may not
seem like traditional deep
learning applications,
traditional cloud applications,
but a lot of these applications
are moving to the cloud
as we speak today.
And a lot of industries
are actually embracing it,
not necessarily resisting it.
And challenges still do exist.
But the biggest challenge,
of course, is accessibility.
So everybody says they
have a lot of data.
But when you go in
and say, can you
give me the data sets,
that becomes a six
months to a one-year exercise.
Because internally,
they have to go figure
out how to go get the data, who
has access to the data sets.
And the biggest
hurdle today really
is not even just the
data sets or the talent.
It's really the
changing the mindsets.
Where everybody wants to do
something with Analytics.
Everybody wants to do
something with cloud.
Everybody understands that it's
going to be valuable to them.
It's about how they
contextualize that value
to their actual context.
That's the biggest hurdle today.
So I'll leave you with that.
Thanks a lot.
Thank you.
Thank you.
[MUSIC PLAYING]
