LAK LAKSHMANAN:
Welcome to GCP Next.
Today, this morning,
on the Twitter feed,
there was someone saying,
oh, no, here's the keynotes.
And we have Disney.
We have Home Depot.
We have HSBC.
Google's going all enterprise.
Where's all the nerd stuff?
Well, here's the nerd stuff.
So welcome to the
nerd stuff, right?
[LAUGHS]
[APPLAUSE]
What we're going
to talk about today
is how to data science GCP.
Who does data science is GCP?
Data engineers, right?
So we're going to talk about
how a data engineer can
do data science
in GCP and how GCP
makes that easy,
makes it auto-scaled,
makes it auto-awesome.
The popular imagination,
if I told you
what machine learning
is, machine learning
has been in the
air today, right?
And we say what is
machine learning?
People say, well, you
have lots of data.
You do some really
complex math somehow
and out comes magic, right?
That's basically what
people's imagination of ML is.
In reality, though, what is ML?
ML is a lot of work.
You're going to spend a
lot of time and effort
collecting the data.
You're going to use
all your ingenuity,
all your knowledge,
all your experience
to organize this data.
And then you're going to
basically bring your insights,
your domain expertise in order
to create a model that somehow
represents what you
know about the domain
that you're working in and
how it relates to the data.
And then you're ready to go.
Once you have this
model in place,
you can now train this model.
And that training is
all very automatic.
And then you end
up in magic, right?
You do end up in magic.
But in order to get
there is a lot of work.
And what GCP can
help you with is
to make this work much more
tractable, much easier to do.
So what we're
talking about here is
that GCP is a place where data
science meets data engineering.
And I would like to introduce
to you my two colleagues, Reza,
who's going to play the
part of a data engineer;
myself, I'm Lak.
So we're going to play
this role of myself
as a CTO, the head of BI for
a company called Alphaplex.
We make widgets.
And Alex here, another
of my colleagues,
is going to play the
part of a data engineer.
So what's the scenario?
The scenario here is
that we run a factory.
And in our factory
we make widgets.
And of course a factory has a
bunch of different machines.
And these machines heat
up the factory floor.
We have a factory.
These machines heat
up the factory floor.
We have people working
in these factories.
So we have to ensure that this
factory is efficiently cooled.
We need to make sure that
we're not spending too
much on cooling the factory.
But if the factory
floor gets too hot,
then we are going to
result in wear and tear.
So we need to figure
out the optimal way
to cool the factory floor.
Fortunately, we've been
collecting data, right?
We've been collecting
data, and we'll
be able to use
that data to solve
this problem of how to
efficiently cool the factory
floor.
What we need to do is
that we have a factory.
We have an air conditioner.
And we need to figure out
when to switch on the AC.
If this sounds familiar to you,
this is kind of the same issue
that we do in
Google data centers
when we need to basically figure
out how to efficiently cool
a data center.
This is a common problem, right?
So we need to figure out
when to switch the AC on.
We have a bunch of
different machines, right?
And these machines essentially
heat up and cool down,
right, the factory.
So we need to basically figure
out how to keep these machines
and how to keep the people
at safe temperatures.
And what data we have is
the inside temperature
of the factory, the
outside temperature.
And the reason we have
that is because we went
to our chief scientist, Isaac.
And Isaac told us
that he actually
has this formula, right?
So Isaac has this
formula that said, now,
if you know the
inside temperature
and you have the
outside temperature,
you do all of this
math, I'll tell you
that if you have a bowl of soup,
how long is it going to take
that bowl of soup to cool down.
So Isaac says he has
the exact formula.
But we say, hey, Newton,
hold your horses, right?
We don't want to
basically use that formula
because real life is
a bit more complicated
than just a bowl of soup.
Why is it more complicated?
Because we can monitor
the inside temperature.
We can monitor the
outside temperature.
But there are other
complicating factors--
complicating factors like
which machines are running,
how many machines
are running, right?
What are the machines doing
and how much are the heating
things up, right?
How many people
are in the factory.
And all of these
kinds of things make
applying that
heuristic, that rule
that we had, really hard to do.
And if you think about machine
learning in the real world,
it's usually a
replacement for you
use to have a bunch of data.
You used to have formulas.
You used to have rules.
You used to have heuristics.
And now you change those
out and replace them
by something that
learns from the data.
So we have collected
data over time
about the factories
inside temperature,
the outside
temperature, what was
running in the factory at
different points in time.
We have that data.
And so we will basically
show you that journey, right?
But first, let's
start where we are.
And let's say that we know
things about our factory floor.
We know the inside temperature.
We know the outside temperature.
Let's figure out how
to basically balance
the cost of cooling versus
the cost of the repairs
that are going to result if
we let the temperature get
too high.
That's basically the balance
that we have to strike.
So Reza, let's build
something cool.
REZA ROKNI: OK.
So I am the data engineer,
and we're a small company.
And so in terms of the
number of data engineers
we have, just me.
So we have some
constraints in terms
of how much stuff
I am able to do.
And one of the lines that
Lak said very casually
is we've been collecting data.
Now, as a data
engineer, that line
is potentially quite
a lot of work for me.
So let's have a look
about the kind of thing
that we needed to put together
to actually collect that data.
So we have this factory.
And again, we have
one data engineer.
I want to make sure that
anything that I'm writing
is, first of all, production
grade, two, it will scale out.
So while we might be doing
an experiment right now
with one factory,
I want to make sure
that if we have factories
across the globe,
if we have many
thousands of IoT devices
in that factory, recording
things like temperature, number
of people, maybe tomorrow
it's measuring things
about the machines themselves
and sending that information
on, no matter how
much data I have,
I want to be able to always
absorb this data coming
from all of these IoT devices.
And it's important I
do this in stream mode.
One, it's important for the data
collection and the processing
but two, as we come to see
when we want to do some more
inference on that data points.
The next piece of collecting--
after collecting the data
there's a couple of other
things I'm going to need to do.
First of all, I'm going to
need to do some processing.
And again, one engineer, I
want to make sure whatever
I write is very small amounts
of code and the systems
that I'm using
are fully managed.
So I don't have to wake up
at 3 o'clock in the morning
to restart servers or to do
things like re-balance things
because our loads have gone up.
In that processing, I'm going
to do two very simple things.
First, I'm going to
element-wise take
that data to a repository
for the propeller
heads to be able to do
their smart data science.
The other branch of this
is my very basic check
to when to turn
the AC system on.
And that is a
simple test that I'm
going to run a sliding
window across the streaming
data that's coming in.
That sliding window's going
to be five minutes long.
For the purpose
of this demo, I'm
just going to do it every-- the
period of that sliding window's
going to be 15 seconds.
So every 15 seconds that
five-minute window's
going to slide down.
I'm going to do a
calculation of a mean.
And I'm going to do a
really simple check, which
is if the value is over
30-degrees centigrade,
then turn the AC on.
I'm from UK, so it would
be centigrade, right?
So let's think
about the components
that I'm going to use.
First of all, I want to use
Google Cloud Pub/Sub API.
It's a fully managed,
published API.
It meets the requirements
for it's global,
so I can see many factories all
sending messages to the API.
I can have a single topic.
So I don't need to worry about
today I've got 100 messages
a second.
Tomorrow I have hundreds of
thousands of messages a second.
I don't need to reshard.
I don't need to think
about partitioning.
I just keep sending
information into that topic.
The next piece of the puzzle
is the processing side.
So I'm going to take this data
that's coming from Pub/Sub,
and I want to process it.
One of the other advantages
that pops up is that it is HA,
and it will hold onto
all information sent
to it for up to seven days until
a subscriber pulls the message
and also acknowledges that
that message has been removed.
So the component that I'm going
to connect to this is Dataflow.
And Dataflow allows us to
do both batch and stream
processing in our pipeline.
In this case, we're just
going to use the streaming
capabilities.
Dataflow will take care
of spinning up machines,
alter heating, all that other
stuff that, again, I don't
want to be concerned with.
I just want to write
simple declarative code
on what my pipeline
should do and then
let the system get on with it.
In that pipeline, I'm
going to have two branches.
The first branch is to go to
BigQuery, which is where we're
going to store our information.
There are multiple choices of
where you could land IoT data.
Especially as it's
time series, Bigtable
could be another option.
However, again,
[INAUDIBLE] just want
to run queries
against this system,
so BigQuery's a very good
place to land this data.
The other branch is the
simple that I want to do.
Take the mean of
the sliding window.
If it's over 30 degrees,
then tell the AC
to either come on or off.
And in this case,
we're actually going
to use PubSub again, although we
haven't got it in this diagram.
The reason we're
using PubSub now
is not as an ingestion
mechanism, but as the glue
amongst my data pipeline.
So when I have a message,
I put it back into PubSub.
It can communicate
with downstream systems
in the factories to be
able to turn this AC on.
And just to see how this
would look on our pipeline
as it's running, could you
connect to the laptop, please?
Thank you.
So this is the monitoring
UI from Dataflow.
And it's actually showing
the pipeline in motion.
We've got the IoT devices
pushing data into a topic
that I am reading from.
I am then parsing the IoT data.
So it's coming in JSON format.
I'm extracting the
bits that I need.
And then it branches out.
The first branch is
that element-wise,
it's sending data to BigQuery.
And by element-wise
I mean there's
no micro batching or batching.
So as data's flowing
in, it's immediately
available to any queries
the guys want to run.
On the other side, I am
doing our sliding window.
So I create a five-minute
sliding window.
I will then extract
the temperature
from the data that exists
within that five-minute window,
pass it to a mean function.
From there, we will
check if the value
is over 30-degrees centigrade.
If it is, I'm going to
send it to two locations.
The first location,
again, is that PubSub,
so that it actually can talk
to the factories downstream.
The other location is
actually to Firebase.
And the reason I'm
sending it to Firebase?
Well, I'm going
to show you here.
So this is that basic test.
It's that information being
sent onto the Firebase engine.
And this is essentially to
help us visualize the tool.
So as the data gets updated,
that fact that it's gone yellow
means that it was
the latest update
from Dataflow pushing to it.
Now I can actually
visualize this in things
like graphs that connect
to the Firebase engine.
And as you'll
notice, you actually
know when my laptop
was down because we've
got these flat lines
going up, because it
didn't have any
data at that point,
and it's just jumped across.
So this system will scale.
Could we go back to
the slides, please?
So the amount of
code I had to write--
this is the code for
the Dataflow section.
For Pub/Sub, I just
created a topic.
There's nothing other than
a DevOps action there.
In terms of the code,
these are snippets.
But just to give you a flavor,
in Dataflow, data that comes in
gets turned into a
parallel collection
of immutable objects.
And what we do with
that collection
is we run a series of
transforms against the data.
The first transform I do is
saying I want a sliding window.
So here, those three
lines at the top--
with window into SlidingWindow,
duration 5 minutes,
with a period of 15 seconds.
Of course, if we were
doing this for real,
we wouldn't use 15 seconds.
We don't want the AC to come
on and off that quickly.
You'd put it at five
minutes, but then you
wouldn't see the nice flashing
indication on the Firebase.
Next, I want to actually do
the calculation of the mean.
Mean, sum, counts, all of
these are standard things
that people want to do
with data in pipelines.
So there's a primitive
built into Dataflow.
That single line of code
allows me to get the mean
from every single window.
Next, we do a very trivial
test, the check threshold.
And that checks threshold
is nothing other
than returning a Boolean
if the value of the mean
is over 30-degrees centigrade.
So that's our very
basic pipeline.
Now we're going to hand it
over to our very smart data
scientists, who are going to
do some magic with that data.
ALEX OSTERLOH: So I didn't
get the mobile microphone,
so I have to stand
here, which is great.
Because otherwise, I run off.
So from a data science point
of view, what we just saw was
we have a rule in
place for deciding
when to turn the AC on and off.
And that rule was,
is the temperature
bigger than 30 or
less than 30, right?
So it's like Google and
with many of our products,
we've been doing this
rule-based decisions
in our products a lot.
And only in the
last couple of years
we turned to machine
learning to actually improve
the way our products work,
make predictions, and increase
the quality.
So one example is if you
do a search on giants,
it depends on probably
where you are.
If you're in San
Francisco here, you
might be interested in the
San Francisco Giants probably.
If you're in New
York, it might be
different types of Giants,
football team versus baseball
team.
So we used to write these
rules where you say, OK, here's
the query.
Then based on the location, if
it's San Francisco or New York,
we give the user
different search results
that are focused on what the
user is probably desiring.
In my case, I'm from Germany.
So if I search for
giants, it might
be the Dortmund Giants, which
I didn't even know existed.
But I was searching for
a team, and, of course,
there's a German Giants team.
So RankBrain is a
good example of how
we switch from using rules
to going to machine learning.
Because instead of just looking
at the search terms themselves
and how often they
appear in a text
and how close they
are to each other,
we actually use natural
language processing to determine
the relevance of a document.
And this is the third
highest ranking criteria
on how we determine
whether a search result is
further up or further down.
And this is the
biggest improvement
we've seen in our search quality
in the last couple of years.
So if we turn to our problem,
we want to cool a factory.
We have this temperature
coming in of 27 degrees.
And then we're
trying to decide how
do we react if the outside
temperature's higher or lower.
Then we have a different
number of machines running.
And then you have
new factors, right?
So, now, people give
off heat as well.
So you might want
to consider that.
But maybe there's
different factories
with different factory
sizes, with different number
of air conditioning
units that you
can set to different levels.
These are all very
different criteria.
So you can see,
turning on or off
the AC, based on these
rules, can become very, very
hard to do or impossible to do.
So machine learning has been
around for, it seems like,
hundreds of years.
So Isaac is thinking now,
could we use machinery
for this maybe instead?
So just a little 101 on machine
learning and, in this case,
deep learning, which I'll show
you in a couple of minutes
and how it works.
Basically, it's a
supervised learning.
We use deep learning here.
This is one method
of machine learning
that works really well for
the stuff we do in Google.
And it works by taking
a lot of label data
that you know this is a
cat, dog, whatever it is.
You train a model
in the first step.
Then you evaluate how
well is this doing.
So you have training data set.
You have test data set.
You find out the
quality, and you
do this in an iterative
approach until you have
the quality that you desire.
And then what you can do
is you can take that model,
put it on a phone, build a cool
app to take a picture of a cat
and decide, is this
a cat or not, right?
So this would be
applying to the model.
So we're going to also use
this for our problem here.
So we are-- because
this all live,
we're training a model
based on all this data
that is already being
captured in BigQuery, right?
So you have number of
machines, inside temperature,
maybe number of people, or
other factors that you can use.
And then you can
apply this model
to actually make a decision
to turn on or off the AC.
Whenever you do
machine learning,
you're trying to optimize
for something, right?
So in this case, we're
trying to minimize the cost.
And cost in our case is the
cost of the wear and tear
of the machines, right?
Because we're saying we
don't want the temperature
too high because it is
bad for the machines.
They need to run at a
certain room temperature.
And we want to minimize
the cost of maintenance
of buying new machines, right?
So this is just checking that
the temperature's not too high.
And on the other side, we
don't want the AC running 24/7
because that's going
to cost energy,
and we don't want to do that.
We want to be efficient in how
we use that energy to power
the AC, right?
So we're defining a
cost factor, which
is a combination of
those two factors.
And we're trying to
balance this out.
Turns out Google has
its own factory floors,
right, our data centers.
And we're actually applying
machine learning here
to optimize how we
run certain workloads.
You can imagine a
lot of the work notes
that we do are actually around
machine learning, right?
So we're updating models and
improving our recommendations
on cat videos and
stuff like this.
And it doesn't matter if this is
running in Oregon or in Finland
or anywhere else.
So machine learning
is a good criteria
to actually make
that decision for us.
And we've been able to reduce
energy costs to up to 40%
when we apply machine learning.
And there's actually a
cool white paper out.
There's a URL on there where
it talks about how we do this.
And these are some
of the features used
to actually create a model
that is optimizing our own data
centers.
So if you remember
our architecture
that we had just a
couple of minutes ago,
we're taking all that data.
We're pushing it into BigQuery.
Now we want to take that
data and train a model.
So you might have
heard of TensorFlow.
It's great because I
can just download it.
It's open source, and
I can do this here
on my laptop, which I'm
actually going to do.
The cool thing is
I can then, when
I'm happy with what I'm
having with the data size,
the small data size
that I'm testing with,
I can actually flip
a switch and say,
now I want to train on the
billion data set with all
the factories on, right?
So doing that switch
is fairly easy--
training with a smaller
data set locally,
but then going full
production and not
needing to be scared if
the data set gets too big
or the features
get too big right.
So to demonstrate
this, I'm going
to be using something
called Cloud Datalab, which
is a Jupiter notebook
hosted by Google Cloud.
And I will be going into
that now, if that's possible.
Can we switch to
the demos again?
Thank you.
So this is Cloud Datalab.
This is a notebook.
It's pretty cool because I can
share this with colleagues.
And we can try to find a cool
way to work with this data,
right?
So this is cool for
experimentation.
I can run Python code in here.
I can run TensorFlow.
It's one of the easiest
ways to run TensorFlow.
But I can also pull in a
lot of the Python libraries
that are out there,
like pandas, that we're
going to use here as well.
So I'm not going to
run everything here.
Obviously, you need to
do some set up, right?
We need to define the project
and the Cloud Storage bucket
and these type of things.
We need to import
some libraries,
define a schema that is
representing the schema that we
have in BigQuery.
Here we are actually going to
pull the data using BigQuery.
So you can actually
see the query in here.
And in that query, we are
also defining the cost.
So we're taking a factor,
which is called unsafe,
which is the time that machines
ran in an unsafe temperature,
which is above 30 degrees.
And we also have a factor of how
much energy was used, the power
that we used on the AC.
So we have a cost.
We have the temperature,
when the fan was turned on,
the number of machines, and
the outside temperature.
So we're pulling that
into a Pandas data frame.
If you remember, we have
all that data to train on.
Now we're going to split the
data into a training data
set and data set to
evaluate our training.
So this is happening.
So we're using 90-10 split of
training data and testing data.
And this is the output of some
of the rows right in here.
So you see the cost, the fan
on, the number of machines.
Outside temperature.
That's pretty cool.
So I'm going to go a little
further down to the part where
we're actually going to train
based on all that data, right?
So we're pulling all
the data from BigQuery,
and we're going to
train and model.
This is using TensorFlow.
It has two hidden layers.
It says it right here.
It says the number
of iterations.
So we're going to go through.
And we have two hidden
layers, one with 64 neurons
and one with 4 neurons.
So I'm actually
going to run that,
and hopefully
something will happen.
So this should just take
not more than seven seconds.
So it says local training done.
So that means it's
created a model,
and it's putting
it in a directory
that we specified, right?
So now we have a
model and a directory.
And now we can do some
prediction on that data, right?
So we can give it some
temperatures, some inside,
outside, number of
machines running.
And it's going to give us
back a cost factor, right?
So here comes the
important part,
where we actually now determine
and evaluate how well did
our model actually do, right?
So remember, we had
the training data.
We have the test data.
And now we want to compare the
two different data points--
how well did our predictions
actually do against some test
data that we held back and
that wasn't used for training?
So this is actually the
table we get right here.
So we have the true
cost that we know about,
and we have predicted cost.
Since this may be little hard
to look at and understand,
there's also visualization.
So that's one of the cool
things, also, with Jupiter
notebooks that you can use all
these visualizations libraries
that are there for Python.
And you get kind of
this heat map, where
you, based on a certain
true cost factor,
you see the spectrum of
what was predicted, right?
So here you have the true cost.
Here you have the
predicted cost.
And this is a pretty good
visualization, giving you
an idea of how
well are we doing.
And maybe this is an indication
for you to go back up
and maybe do some parameter
tuning and try some things out,
do a different split
maybe on training and test
and these type of
things, until you're
at a quality that is good
enough for your use case.
Now I'm at the point where
we actually have a model.
And what we're not doing here
is we're not training this
again on Cloud ML
on a billion data
points because we don't
have the time for you all
to see how long this takes.
So we are going to keep
the model as we have it.
And we're going to push
it in Cloud ML, which
means I can give
Reza, in a second,
an API call that he can use
to ping the Cloud ML API
and give it some values.
And these are the
values that are
going to come in though
PubSub and Dataflow,
give him some values and
make a prediction on what
action should be taken.
So what you could do, actually,
is you give it two numbers.
You give it maybe the current
temperature indications.
And then you can say, OK,
the way it's going right now,
the way the temperature's
moving right now,
what if the temperature's
one degree higher or two
degrees higher, what
would my actual cost be?
So you get two values back.
And that could be an
indication for you
to take action now
or wait two minutes
because the cost,
actually, of turning
the AC on in two minutes is
going to be actually lower.
So this is the type
of back and forth
that I can now give to Reza
that he can now, hopefully,
without having to tear down
his pipeline that he built,
can actually use
in the architecture
that he'll be
demonstrating in a second.
Great.
So I think we go back
to slides, right?
REZA ROKNI: Yeah, thanks.
So the key component
that I'm actually
going to have to swap out
is that check that I was
doing against the temperature.
So before, if we
remember our pipeline,
we had IoT devices all sending
streams of data to PubSub.
PubSub was being
consumed by Dataflow.
We were splitting
the data, sending it
to BigQuery, which allowed
Alex to be able to do his work.
On the other side, we
had that simple test,
where I was doing a
sliding window of five
minutes, every 15 seconds,
calculating the mean,
and doing a check.
Now I just need to replace
that with the inference
that we have now.
And I also no longer
need that sliding window.
So that sliding window was just
there for me to get an average,
rather than go point by point.
Here, the system can just
take each point by the values
that we have and then
give us a prediction back.
So the one thing
I'll need to do is
change from that check
threshold, where I was just
checking the temp value
against 30 degrees,
to the new code, which is
to call the Inference API.
And the Inference
endpoint is the one
that has been deployed
with that version number.
We have the package shape
that I need to send to it.
So that packet shape
is a JSON structure
I'll need to put together.
That JSON structure
comes from the same data
that I was collecting before.
So in order to actually
switch this out,
I will just need to replace
my checksum with a REST call
to the Cloud ML service.
And then from there,
the Cloud ML service
is going to give me
a prediction back.
That prediction back
will be in JSON format.
I need to just extract
the piece of data
that says whether the fan
should come on or off.
And then through the same
part as the pipeline,
I will actually send
it on to the factory.
And it will decide whether
to issue the on or off.
So if we could just go back
to this laptop, please.
So previously, the section
that we hadn't shown
was this middle bit, which
was taking that data.
And instead of doing the
five-minutes sliding window,
it's actually calling
the protection service.
We get the output from
the prediction service.
I have a simple
JSON deserialization
to take the values that
I want out from that.
And then I send it again on
to both PubSub and Firebase.
If we go to Firebase, I'll
actually go up a level now.
We can see two data points.
One was the original
flow, which has got check
whether over 30 degrees.
The other one is
now the inference
being called from
the Cloud ML service.
And the reason,
actually, we'd want
to keep both running is,
like any good data engineer,
I'm not going to just
switch over from production
in a second, right?
This is all good stuff.
I trust Alex has
done a great job.
But I want to make sure
everything continues working
before I make that switch.
And this is where we start
doing the AB testing, right?
In this case, we're AB testing
a very old, basic method
of checking when the fan should
come on with the new inference
model.
But as time moves
on, let's say this
is all very good, successful,
and I actually make the switch,
we're going to have
different models being built.
So maybe Alex decides that
actually given the data set,
every week we want
to rebuild the model.
Maybe it's every day.
Whatever it is, the
Cloud ML service
will allow you to create
different versions of the model
as you do that.
So you V1, V2, V3, V4.
This allows me to
continue to make sure
that I can do AB testing
of that new model.
So if it's happening
over the day
and there's a new
model landing, I
can maybe decide that what I'll
always do as a matter of course
is 30% of the workload
will go into the new model.
70% will continue the old model.
Once I'm happy everything's in
place, I can make the switch.
Dataflow allows you
to do in-situ updates.
So you can actually update
the data flow as it's running.
It will take care of
redistributing the code.
And at that point,
switch onto the new model
until another model comes
along and I switch it back out.
And these are the
kind of things that
allows us to take
the data science
and actually make in
production and usable very,
very efficiently.
So back over to Lak.
LAK LAKSHMANAN: Thanks.
Could we get back
to the demo machine?
REZA ROKNI: Do you
want the slides?
LAK LAKSHMANAN: Yeah,
I want the demo.
REZA ROKNI: The demo machine?
LAK LAKSHMANAN: Yes, the
demo machine, please.
Right, so just to recap,
the first thing is, in order
to do something like this, in
order to do something like this
to make it easy, notice that
the first thing that Reza did
was that he made sure
that he took all the data.
Before he did all
the sliding windows,
before he did all
the aggregations,
there is this nice little thing
that says Write Raw Data to BQ.
That's where everything starts.
If you're not storing data,
you're not going anywhere.
It's not going to anyplace
that's auto-awesome.
The first thing
is save your data.
And once you have saved
the data, we can then go,
and that's basically, that
is the data that Alex could
basically use in his Datalab.
And you'll notice that his
Datalab started with BigQuery.
He went into BigQuery query and
said, let me go ahead and pull
my data, right-- bq.Query right?
That's essentially
the basic first step
that you need to be able to do.
You need to be saving
the data in order
to go back and replace
the kinds of systems
that you are doing now
with something better.
I cannot emphasize this more.
Make sure to save your data.
Notice some of the other
things that happened, right?
As Alex built the Datalab
model and he deployed it,
it was a very simple
Cloud ML deploy.
It went onto the cloud.
And he essentially got
something that could
be called with a REST API.
And Reza knew that
he didn't have
to worry that this model
that Alex had built--
built by a scientist--
was actually going to
work in production.
It was because it's basically
running in the Cloud ML
service.
And Reza could simply
invoke a REST API call
from his Dataflow code.
Now, these are the kinds
of engineering improvements
and guarantees that enable you
to do good data engineering
and data science on GCP, right?
And Reza went ahead and
pointed out a few other things,
like the ability
to do AB testing,
to be able to have both the
old service and the new service
running at the same time, so
you basically build confidence
in your business users before
you actually make the switch.
Could we get back to
the slides, please?
So bottom line, then, is that
this is now an amazing time
to be a data engineer
and a data scientist.
And the reason it's
an amazing time
to be a data engineer
or a data scientist
is that the amount of
work that you need to do
has gone dramatically down.
If you're thinking
about building machine
learning models,
the amount of work
that you need to do to build
a machine learning model
fits in a single
Datalab notebook.
And if you looked at the
code for that single Datalab
notebook, as a first start, it
was all like structured data,
go ahead and preprocess,
structured data,
go ahead and train.
It was single calls
on the Python API,
a very easy way to get started
doing machine learning,
starting with structured data.
And a great tend to be a data
engine engineer because, again,
you can write Dataflow code
and have all the DevOps
and have all the auto-scaling
and all of these things
be taken care of by
the cloud, right?
So what exactly
does auto-awesome
mean to you as a data engineer
or as a data scientist?
As a data engineer,
I think what it
means is that whether
you're doing ingestion--
notice that Reza started
out ingesting data
with cloud SubPub.
And it could scale
from tens of messages
to millions of messages,
from local to global.
And he didn't have to
lift a finger, right?
He just set up a
topic, and that was it.
And then anybody could
basically post into that topic.
The ingestion was auto-awesome.
Transformation, how did
Reza do the transformation?
He used Apache Beam
Cloud Dataflow.
He wrote his code
in Apache Beam,
which is totally open source.
He executed it using
Cloud Dataflow on GCP.
And that essentially, again,
enabled the transformation code
to be automatic, right?
And we talked about how you can
basically deploy a new service,
and you can replace
a running pipeline.
How amazing is
that, right, to be
able to replace a
running pipeline,
to not lose any messages
in the process of replacing
a running pipeline?
That's the key thing, right?
When we say you can replace a
running pipeline, what we're
saying is your old
pipeline processes
until a certain message.
And the new pipeline takes
on at the next message.
You've not actually lost any
message in the process, right?
You can replace a
running pipeline
with no transformation.
And the other thing, we kind of
glossed over it a little bit,
but as the Dataflow pipeline
was writing things out,
it was writing
them into BigQuery.
And we could be querying the
data, even as it was done.
You saw the Firebase app.
The Firebase app was making
a query into BigQuery,
even as the data
were streaming in.
Again, we made it
look very easy, right?
But this is extremely,
extremely hard.
Think about being able to do
querying on streaming data,
right?
And we're doing SQL
queries on streaming data
to power that Firebase
graphical thing
and, finally, to be
able to do your training
on your machine learning model.
And the good thing
is that all of this
will auto-scale to thousands
of machines on demand.
All we are writing is
we are writing code.
We're writing code.
We are defining the logic.
And everything else
is just auto-scaling.
The other thing that we
took a huge advantage of
was that when Alex
did his training,
he was training
it on batch data.
But then when we
did the predictions,
we were predicting
on streaming data.
And it was actually a
very easy transition.
And the only reason that it
was a very easy transition
was that Reza was
using Dataflow.
And Dataflow is a
programming model
that lets you deal with
batch data and streaming data
in exactly the same way.
So we could train a
model on historical data,
which is what you always do.
But you predict on
newly arriving data,
which is what you always want.
And we didn't get into this,
but the model that Alex built,
that was his first model.
And over time we'll need to
do hyper-parameter tuning.
We'll need to tune the models.
And all of that
is also automatic.
All of these things about tuning
a model and making them better,
those can also be all part of
your whole process of making
them completely automatic.
So bottom line, then, what
GCP offers to data scientists,
to data engineers
is that we offer
a way to fuel your innovation.
You write your innovation.
We give you the rocket ship.
We give you the rocket ship
to fuel your innovation.
Whether you're talking
about ingestion with PubSub,
you're talking about
transformation with Dataflow,
you're talking about analyzing
a data with BigQuery,
doing your experimentation
with Datalab,
you're doing machine
learning with Cloud ML,
it all just works.
And that's ultimately the
thing that makes it all--
it's all very well integrated.
And it provides you a
huge amount of innovation.
So a quick shout out
to related sessions--
today at 2:40 PM there is an
IoT solution on Google Cloud.
And there is another
talk on Apache Beam,
How to do portable and
parallel data processing.
So both of those
are very related
to this session
in terms of being
able to do stream processing.
And if you looked at
the session and said,
hey, these are all
the kinds of things
that I do these days, I
strongly encourage you to go
visit the certification launch.
The Data Engineer
certification on Google Cloud
is currently in beta
so it's discounted.
So please go ahead
and give it a try.
You can take the exam
right here at Next.
And thirdly, the
shameless plug, there
is my book, "Data
Science on GCP,"
which basically
goes from ingestion
all the way to machine learning,
kind of the same kind of thing
that we talked about here, but
using a use case that's around
predicting flight delays--
how to ingest data and
then go all the way
to doing streaming
real-time predictions,
using time windows, et cetera.
So I step you through
this entire process.
The book is an early release.
So you can go to the
O'Reilly website.
You can start
reading the book now.
So thank you all very much.
[APPLAUSE]
