MIKHAIL CHRESTKHA: I
think everyone in the room
will agree that R is a
very active community.
Some of the best data
scientists, statisticians,
mathematicians out there.
It's also had over the
last two decades, a very
healthy and steady growth.
But year after year
I hear, can R scale.
Can it deal with
very large data.
So that's what we're
here today to do.
To talk to you about
how you can scale your R
workloads on Google Cloud.
Our agenda for today.
We'll be talking
about the evolving R
ecosystem on Google Cloud.
Three reference workflows
for training and serving
machine learning models.
One, TensorFlow, two, Spark,
and any R library that
might be a favorite of yours.
Finally, we'll close with a
Q&A with an online Dory app?
What is a Dory?
Well, it's a Q&A messaging app
that we use internally for Q&A
at scale to make sure
everyone has an opportunity
to ask questions.
If you go to our Google Cloud
Next app, find our session,
you'll see a Dory Q&A link.
This will allow you
to compose a question
as well as vote to move the
good questions up to the top
that you want to see answered.
This will actually be
open for the entire week.
So if we don't get to
all your questions,
we'll make sure to answer them
directly in the app itself.
So let's get started with the
R ecosystem on Google Cloud.
I posed the question
around can R scale.
One response might be,
does it really matter,
is there enough demand for it.
So two areas three look for our
distributed data processing,
as well as deep learning.
Two, workloads that require
lots and lots of computation.
And if I do a quick
analysis on crane logs,
the number of crane downloads,
from 2017 Q4 versus 2018 Q4,
TensorFlow and Keras libraries,
two deep learning frameworks,
have doubled in usage.
This really is a signal
that data scientists
are exploring deep
learning to solve new novel
ideas in new problem domains.
Also, Sparklyr has
actually grown by 6x
during that same time
period, a great signal
that data scientists
are really pushing
the limits of one
machine and the memory
that that really
restricts you to,
and looking for distributed
options on clusters.
So Google Cloud should scale.
I've been on Google
cloud for two years,
and I will say I felt
neglected as an R user.
Of course, as a cloud
provider, developers
were our primary audience.
We want to make sure
that developers can build
applications on the cloud.
We provided client libraries
like Java, Node.js, Golang.
Next, we needed to help data
engineers build these pipelines
and data stores, providing
a SQL interface, maybe
a Java API for big data
frameworks like Apache Beam.
And then finally,
TensorFlow is still
central to our ML ecosystem.
So naturally, Python was the
first class language choice
for ML engineers.
But after two years
of feeling neglected,
I really am happy
to talk about some
of the evolution of Google Cloud
to support the R community.
Natively, we do have a lot
of notebook options now.
We also give you options to run
Spark on a managed auto scaling
Spark service.
Allow you to train TensorFlow
models or any custom R library
models.
And really, a big thanks
to the overall community.
There's a lot of packages
that have been contributed
from Mark Edmondson, Hadley
Wickham and the RStudio team
to really interface a
lot of the great big data
and ML services that
Google Cloud has to offer.
So let's talk about how
this actually looks like.
At the very top, we have lots
of different notebook solutions.
Zeppelin, everyone's favorite,
RStudio, RID, RStudio,
as well as a managed you Jupyter
notebook within our kernel.
Then moving to the
very bottom left,
we have ingestion
and storage options.
The two main ones
we'll cover today
are Google Cloud Storage, or
object storage service, really,
for unstructured data.
Working with images, free
text files, audio samples.
And then for structured
and semi-structured data,
we have BigQuery, our petabyte
scale fully server data
warehouse.
As we move more into exploratory
work and data profiling,
what are our options to
explore and prepare that data?
We have a SQL syntax
on top of BigQuery
that allows you to crunch
through billions of rows.
We also have Spark, again, the
managed auto scaling cluster
that I talked about,
if you want to use
some of the more
statistical techniques
like PCA or dimension
reduction techniques that
are available through Spark's
native ML lib library.
Now, this is where
things get interesting.
Usually, I would move
on to the right side
and build a custom ML pipeline.
But I do want to stop and think
about the bottom middle there,
where before you
invest time in that,
you should actually
look into if some
if our pre-trained APIs can
offer up some solutions that
can get you a good way there.
We have a lot of
different pre-trained APIs
in the natural language space,
the image space, translation,
audio transcriptions.
And we actually have a
ML service built directly
into our data
warehouse, BigQuery.
And on the right side, this
is where we have invested
and we believe we can add
value in building a robust ML
pipeline.
And the three workflows
we're going to cover today
are TensorFlow and Keras
on a managed service
called Cloud Machine
Learning Engine.
We'll have Sparks ML
lib library outside
of just the data processing
techniques on Dataproc.
And then the number
one question I've
gone over the last year,
what about any R library?
I like to use my statistics
libraries, my caret, my R part.
How can we use a container
based approach to actually
run those at scale as well.
So before I jump into
the workflows, which
we'll primarily focus on, I
want to just give you also,
two quick examples
of how to embed some
of these workflows upstream.
So let's say I want to
query lots and lots of data.
On the left side, I'm
loading bigrquery.
This is created
by Hadley Wickham.
And with a few lines of
code, nothing special,
your normal SQL syntax,
but this exact query
actually crunched
through 100 billion rows,
3 and 1/2 terabytes
in 45 seconds.
And then it downloads it
as a data frame for you
to then run ggplot and
other various analytics,
and [INAUDIBLE] on top.
So very powerful for
quick exploratory work.
And really allows you to
push all that scale work
to our managed service,
and then bring what you
need into your R data frame.
On the other side
of things, let's say
I work for customer service.
And I want to analyze all the
customer logs for sentiment
and what our customer is
actually calling about.
Usually, I would need to
go to probably go to school
and get a deep learning degree.
I would then maybe have to
look for a word embedding
and understand the relationships
between the different words.
Having gone to creating
a labeling data
set that is required
for machine learning.
So before I invest in all
that or hire a team member who
can do that, here's
a few lines of code
where you can access our
Natural Language API.
The Google language R
API, that was contributed
by Mark Edmonson, that allows
you with a few lines of code
to analyze the text
extract various entities
ranked by salience
or importance.
To classify the text
into various categories,
as well as then look
at various sentiments.
Now this might not
be your end all be.
All this might be an upstream
data enrichment exercise
where you then combine some
of these outputs with some
of your structured data to build
a more traditional structured
data model.
So now let's dive into
the actual workflows,
and the first one is TensorFlow.
The R studio team
has been working
alongside our TensorFlow
team to really
build a great set
of R APIs interfaces
on top of the TensorFlow
Python modules.
On the very left side, we have
our three core R packages.
Starting from the bottom,
we have TensorFlow.
This is the lowest
level API that
allows you to build the graph of
operations, gives you the most
flexibility and customization.
Moving onto the
middle, TF estimators.
This allows you to use out
of the box algorithms--
regression, classification,
some of your common neural net
architectures.
As well as even moving into
spaces like support vector
machines.
And, finally, we have Keras,
our top level high level
API that's really built for
usability and quick prototyping
and rapid experimentation.
And then on the
right side, we have
a lot of complementary
supporting libraries.
We have TF data sets.
A lot of times, when
you're building a machine
learning model, the
training process
doesn't always have
to be your bottleneck.
It might be how
quick you're feeding
all the data in the
appropriate file formats.
So TF datasets allows you to
work with different file types
to look at data data
parallel pipelines.
TF runs, tracking, visualizing,
monitoring those lost curves,
looking at evaluation metrics.
For those of you who have
explored TensorFlow on Python,
this is very analogous
to the TensorBoard tool
directly in your R UI.
Third is tf.
While you'll use R to build and
train your TensorFlow models,
one of the main
powers of TensorFlow,
it is a language agnostic
system, a framework.
Where any other
downstream application,
whether it be Java,
C Sharp, can all
consume all those predictions.
So TF deploy allows you to
export, and manage, and save
all those models moving forward.
And then finally, our
core library, Cloud ML,
is an interface to Cloud Machine
Learning Engine, a managed
service that allows you to
provision a cluster of machines
with GPUs or CPUs, as
well as also hosting
the models as Cloud
Endpoints for you
to consume in terms of an
instance level or batch level
prediction.
So what does this look like in
a more holistic architecture?
We do have RStudio Server
Pro available with one click
deployment through
our GCP marketplace.
The code can be version
controlled within a Cloud
Source repository.
This is a private
git repo, but you're
welcome to use GitHub
a Bitbucket as well.
Once you package
your TensorFlow up--
and then let's say we're working
on a deep learning model--
we would store all our images
in the bottom left there
with Google Cloud Storage.
Then we'll package
the TensorFlow code.
And the main piece
here again is you
might be doing like
development work
on the environment your RStudio
notebook is based off of,
but you're going to actually
submit the ML training as a job
to Cloud Machine
Learning Engine.
You'll be able to only
provision resources
for a number of
minutes, a number
of hours with the
appropriate hardware
accelerators that you
need at that time.
So it's a very job
focused mindset
from an always running
development environment.
Next, we're going to save
and store our results as well
as our model binaries in
Google Cloud Storage bucket,
and a hierarchy of
your choice to really,
version them as needed.
And, finally, ML Engine also has
a secondary module for serving.
This is where a model
essentially automatically gets
containerized and hosted
as a [? client ?] end point
for you to be able to
access as a REST API,
for instance level,
one time predictions,
maybe for a web or
mobile application.
Or if you need batch predictions
for running through millions
or billions of rows
of a table, you
can run the batch
option as well.
Also interestingly, ML engine
is a cloud based solution.
You have to be connected to it.
RStudio Connect actually
is a very similar offering
where you can maybe host that
on premise on a server as well,
and host some of
those models, where
you might use our burst
compute to train the models,
but then still want to
serve them on premise.
And then finally,
the API call can
be consumed by various
downstream applications,
whether you stay
in the R ecosystem
with R Shiny or other
mobile and web applications
Let's dive a little bit now into
the actual code for training
on Cloud ML Engine again just
loading the cloud ML library
cloudml_underscoretrain
is the main function
where you actually just point
towards your TensorFlow code
that's sitting in a R file.
There's a lot of
optional parameters,
and this is really
where the power is.
The master type
or the scale tier
allows you to access those
resources for only the time
that training is taken.
A standard GPU gives you
access to a Nvidia K80.
We have a P100.
That's also an option.
But if you actually want to
even scale further and have
multiple GPUs on one
machine, we have up to eight
V100s on one virtual machine
with a complex, large model
there.
So with quick plug
and play, you really
have access to various
choices to really optimize
both price as well
as performance
when you're training
models and thinking
about this on a job mindset.
Next we have deploy and predict.
We're going to load the
Cloud ML library once again.
Once you've really
trained the model,
you're going to have a directory
where that model is saved.
You can export that model.
In this case, I'm
calling it "savemodel."
And then I'm going to deploy
and host it as an endpoint.
In this case, I'll
call it "keras_mnist."
Now at that point, you have
a lot of different options
from downstream applications.
But if you wanted to run
batch predictions or sample
predictions directly
in the R environment,
you can use the cloudml_predict
options to cycle through
and see whether the images
are being predicted correctly,
whether audio samples are
being transcribed correctly,
or any other traditional
structured data problems
that you may be solving.
So the great thing is
JJ Allaire from RStudio
was actually at
Next two years ago
introducing a lot
of these libraries.
So this has been around
for almost two years now.
So this is a rich ecosystem
where RStudio has, actually,
great resources, tutorials,
galleries, samples for you
to take a look at, various
use cases across industries--
everything from health care,
to traditional time series
forecasting, which gets a boost
with deep learning techniques,
as well as natural language.
So please check that out.
And this is probably very
mature, currently, out there.
So now I'll kick it off to
Chris to talk a little bit
about Spark with R on GCP.
CHRISTOPHER CROSBIE: Thanks.
So I am Chris.
I'm a product manager here at
Google Cloud for our open data
and analytics team.
And I'm here to talk
about how Spark,
combined with
Cloud Dataproc, can
help you scale your R analysis.
Now if you're not
familiar, Cloud Dataproc,
that is our managed
Hadoop and Spark service.
And many of you probably
hear the word "Hadoop"
and zone out instantly,
thinking, OK,
that old legacy technology.
And that's valid in some ways.
But Spark is still a very active
community for machine learning.
In fact, in two
weeks, right here,
there's going to be a
Spark AI conference.
And the reason why, it's
because Apache Spark,
it's a computing
framework that lets
you scale across a
cluster of computers
in a generalizable way.
And that's perfect for
building a rich ecosystem of ML
libraries.
And the other great
thing about Spark
is it lets us write our R code.
And that means
that Cloud Dataproc
can scale your existing R
analysis without requiring
substantial code rewrites.
Cloud Dataproc is
often the fastest way
to move your analysis and
ML, with R, into the cloud.
So our goal with
Cloud Dataproc--
we want to let you take the
open source tools, algorithms,
and programming languages,
like R, that you
used today, but make it easy to
apply them to cloud-scale data
sets.
And then you can do that
without having to manage tool--
clusters and computers.
Now there's three ways to
run R on Dataproc today.
There's RStudio, which
is the standard web
IDE that can be auto-installed
on the Dataproc master node.
And then there's two different
packages for running R
on Spark, sparklyr and SparkR.
And they have a lot
of similarities,
but there are some
differences that I'm
going to explain as well.
So sparklyr, that's a package
for working with R on Spark
that was developed by RStudio.
So you're going to see a lot
of great hooks and integrations
between RStudio and sparklyr.
SparkR, that's built
directly into Spark itself.
And so you're going to find
great hooks there with things
like Cloud Dataproc's jobs
API, which-- and that jobs API,
that makes it really easy
to submit jobs through--
submit R jobs through--
things like the gcloud command
or an HTTP endpoint.
So you can build automated
tooling around your R code.
But what I'd like to do now
is just take a quick peek
under the hood of both
SparkR and sparkly
to help you understand
how these tools actually
scale your R analysis.
So this is a very high-level
SparkR architecture.
There's two pieces to Spark.
There's going to be the
Spark driver, and then
the Spark workers--
again, very high-level.
But oftentimes,
this Spark driver,
depending on what
mode of Spark you're
in, that's often going to
run exactly where your R
Studio's running.
And you're going to spin up
what's called a SparkR context.
You can think about that
essentially as a client
for calling into Spark.
Now, under the hood, what
happens in that SparkR context
is there's a bridge.
It's done over a
socket layer that's
going to translate your
R code into Java code.
And then there's a
Java Spark context
that's actually going
to scale your work
across the various
Spark executors.
You can think of
those, basically,
as nodes in the cluster
at a high level.
Now, what's going to happen
from those Spark executors is
it's going to actually
make a call out
to a local version of R. And
so that's real R. That means
any package that you can
take with R and scale,
you can scale with Spark--
SparkR rather-- because this is
real R code that you're running
at the--
once it's broken up.
Now alternative architecture
to that is sparklyr.
Now what's sparklyr
does instead is
they're going to serialize the
information into Apache Arrow.
And that means, in
terms of ML algorith--
and then what they
do is they call out--
they use that
information to call out
to the spark.ml package.
So that means, in terms
of machine learning,
you're going to be limited
to the functionality that
exists in the spark.ml package.
So there are some--
now despite that, there
are some pretty cool features
I found with sparklyr
that are missing from SparkR.
So for instance, they do have
a direct integration with CRAN.
They have an XGBoost extension.
And they also
integrate with broom.
If you're not
familiar with broom,
it's a pretty neat ML package.
It takes the output of a lot of
unstructured, messy data sets.
Like, if you have a bunch of
t-tests and it comes back with
a bunch of unstructured
information, it'll--
broom will take it and turn
it into a nice, tidy data
frame that will let you look
at results of multiple models
at once pretty quickly.
It's pretty-- it's interesting.
So of course, this is going
to beg the question I'm
sure everyone has right now--
OK, what do you want me to
use, SparkR or sparklyr?
So SparkR, I find that folks
use it a little bit more
for automation.
First of all, it's built
into Cloud Dataproc
and to Spark itself,
so you get a little bit
better performance.
But sparklyr, again, that's
coming from the Tidy RStudio
team.
So that does tend to have a
little bit better integrations
with RStudio.
SparkR that comes from AMPLab,
the same people who did Spark.
So you're going
to find that they
are going to follow
things in a little bit
more Spark-first kind of way.
In other words, it's
going to follow, mostly,
the Scala way of doing things.
Whereas sparklyr, they think
a little bit more R-first.
And then sometimes they'll
do an R integration
at the sacrifice of
something like performance.
But both are active projects.
And so when there are examples
like that, what you find
is they tend to get
resolved pretty quickly.
So oftentimes it
just comes down to,
where you want to spend
your time, automating
R code, SparkR, or is it working
with interactive RStudio-type
analysis?
And then you'll go to sparklyr.
But if you're still a little
confused on the difference
and what to use.
Don't worry, I probably am too.
But luckily, I found it's not
really an issue to use both.
So all you-- once you
load the sparkly--
the sparklyr package, you can
then just call into SparkR.
It does step on some of
the deep layer functions.
But as long as you
mask it correctly,
you can go ahead and use both.
And a code-- a set of code that
has both sparklyr and SparkR
can be submitted to
the Dataproc jobs API.
So what I'm going to do now is
I would like to quickly walk
through a example of how you
can use Cloud Dataproc to scale
some R analysis.
And what I'm going to show
you is a common model,
like the one that you see here.
We're going to run
through a workflow
where the data starts in
BigQuery, which is often
a common store for
semi-structured or structured
data sets.
We're going to spin up an
auto-scaling Cloud Dataproc
cluster to actually do
our machine learning.
And then we're going to save
the output of that model
into Google Cloud Storage.
And that way, later on,
when I want to rerun this,
I'm going to use a
Cloud Dataproc workflow
template, which is a set--
it's basically a graph of jobs
that you can send to Dataproc.
It'll spin up a right-sized
cluster to run those jobs,
run those jobs, make sure
that the cluster is torn down,
and then save all the
output of what happened
and all the logs
into Stackdriver.
So Mikhail, do you want to take
over on your laptop, flip over?
OK, go ahead and kick us off.
OK, so essentially, I'm
a product manager here
at Google Cloud.
So a lot of time I
spend essentially
building articles or blogs that
I put out onto the internet.
And what I really
love-- what I'd really
love to have is a machine
learning model that
would read my blog and then
say, how successful is this
going to be on the internet?
And that might give
me an opportunity
to go back and tweak some
things before I actually go
and try to publish it.
So, whenever I'm looking
for data sets like this,
I always start with
BigQuery's public data
sets program that has over 180
different publicly available
data sets that are stored for
free that I can take and use
to augment my analysis.
And what I do-- and
what I'm doing here
is I found a great data
set for Hacker News.
And Hacker News data
set, it has, basically,
a score, which is how well that
article did on the internet.
That score is a combination
of upvotes and relinks.
And it also has the full
text of the article.
So I can use that to train.
Now, what you're
seeing in this query
here is just I'm taking
scores greater than zero
and a length of text greater
than zero, only to get a nice,
clean data set.
Because what I'm
trying to do here
is-- although BigQuery
is awesome at SQL,
I wanted to take
advantage of R and some
of the machine learning
capabilities that Cloud
Dataproc can offer me.
So what I'm going to do is just
going to put this into a table.
I could go directly
with a query,
but I'm just going to
make it easy on myself
and take a thousand-row
sample, and just
put it into this Hacker
News sample data set.
And you can see-- in the
bottom of the screen here,
you can see what
I'm getting back,
which is a score
that I'm going to use
for a label, along
with the full text.
So once that job has actually
completed, what I'm going to do
is I'm going to jump back
over to Cloud Dataproc.
And I'm going to
click on my cluster.
And I'm going to come
into a web interface.
Using our component gateway,
we expose various web UIs that
are running on your cluster.
And then we make sure that
you can come into those using
your cloud IAM credentials.
So you saw me just jump
into Zeppelin, which
is a Spark-based notebook.
But it also can work with R.
And so the first thing that
you're seeing me do here
is just-- that same
query I ran in BigQuery,
I could have run that
directly in Zeppelin.
There is an interpreter.
So I'm just showing that.
But what I'm doing-- what I
really want to do is I'm going
to run this Python function
that's going to let me take
that--
an arbitrary tables
of my Hacker News
sample and pull that
into a Spark data frame.
Now once that's into
a Spark data frame,
I'm just going to
do a quick count
to establish that, hey, there's
a thousand rows in that sample.
That matches my limit, so I'm
probably on the right track.
Now what I want
to do is I'm going
to register this data frame.
And so that-- with one line, I'm
going to register my data frame
as a temp table.
And that's going to let me
work with this same exact data
frame across all the
languages of Spark.
So that's Scala,
SQL, Python, R. I
can go back and forth without
having to move infrastructure
or data around.
It's all in the same data frame.
And you can see,
now, that I can start
to do some of my
basic R explorations,
things that look completely
normal, like filtering on data
for scores greater than 50.
But I can now do this
on cloud-scale data sets
instead of just the R
that's running on my laptop.
Now there's two
ways I can actually
scale analysis using Spark.
First off, this is an
example where I'm just
going to tune a bunch--
send a GLM across
Spark and send it
with a lot of different
hyperparameters.
So I can tune a lot of
smaller models at once
with different hyperparameters,
get those all on a list,
and then analyze which model
maybe performed the best.
But in this example here,
I'm actually not going
to do that list apply.
I'm going to do a de-apply.
Because I just want
to run one NLP package
that I love in Spark across that
whole data set of Hacker News.
Now in this case, I just
have my thousand samples
that came back.
But this is enough
where I'm saying, yep,
I think I'm on the right
track with this model.
So what I want to
do now is I'm going
to go back up to
the function where
I was calling my sample table.
And I'm now going to change
that to the full Hacker News
data set-- so the entire corpus
of notes, not just my sample.
And then just by
clicking this off,
I'm going to rerun all
that same code you just
saw, except this time, on the
full Hacker News data set.
And as this kicks
off, what I'd real--
as this kicks off,
what I want to do
is come back into Dataproc
and take a look at what's
happening underneath the hood.
So I started with
three VMs, which
is totally fine for
my thousand-node
as I was manipulating the
data and getting ready.
But about two minutes later,
on the larger data set,
as I'm pulling that
data in from BigQuery,
the Dataproc autoscaler
says, hey, you
need a little bit more help.
Let me add a couple
nodes to your cluster.
Then, a few minutes
later, there's
another spike in the
yarn memory usage,
which is because I'm actually
now training on that text data.
So the cluster got even bigger.
But once that model
training completes,
I just refresh the page.
And the Dataproc autoscaler
has said, OK, I've
used all the memory I need.
And it scales my cluster
back down for me.
And so all of that was able
to occur without me ever
changing from the notebook.
I didn't change environments.
I just clicked Run, and
Cloud Dataprco autoscaler
knew how to respond.
Finally, if I wanted to
save this model off and not
keep a running cluster,
but have something else
around-- have it around, I could
create a workflow template.
And I'm not going to step
through all of the steps here.
But essentially, it
would end up looking
like this in the console,
where, at any time, if I wanted
to rerun all this with
the associated cluster,
I would just click Run.
And that would be it.
All right, we can flip
back to the slide deck now.
So if you want to go,
and get started, and do
that exact same
analysis, this code
is what I used to
spin up the cluster.
It's pretty straightforward.
It's just I'm calling
into Cloud Dataproc.
I'm saying, use the
optional components
of Jupyter, Zeppelin, Anaconda.
In this example, you could
also throw in RStudio.
I used the Zeppelin, because
I want to go cross-languages.
But you could
definitely use RStudio
with this initialization action.
And then I installed
a BigQuery connector
and the enable
component gateway.
That's what exposed
those web UI links for me
and made sure that I had
permission to get into them.
And finally, the
auto-scaling policy--
you can actually tune the
auto-scaling policy to--
there's a lot of
knobs that we expose.
If you don't care
about those knobs,
you don't have to use them.
But they are exposed,
if you want to,
in the settings
of the autoscaler,
to how aggressive you want it
or how often you want to scale.
So with that, I'm going
to hand over to Greg.
GREG MIKELS: All right.
Thanks, everyone, for
joining us this morning.
So we heard about some great
ways to leverage TensorFlow
and Spark ecosystems
using R and GCP.
But what about the cases where
you need to train and deploy
a model using any R library?
So let's look at an
end-to-end solution for this.
So first we need to set up
an R development environment.
And Cloud ML Notebooks
is a great tool
on Google Cloud for data
scientists to do this.
We can get started
with one click,
easily scale up with Compute
Engine machine types,
add or remove GPUs as
necessary, leverage both R
and Python from the notebook,
and enable team collaboration
with Git integration
and Cloud AI Hub.
We can publish notebooks
to Cloud AI Hub.
Specifically, for R users, we
can get started in four steps.
So we can launch a
PyTorch ML Notebook
which has Conda installed.
And this is going
to make it easier
to install our R-based
package and libraries.
We're also going to
load the rpy2 extension
to enable R cell magic.
And this is what
allows us to run both R
and Python in the same notebook.
We're going to create a
directory in our home folder
for libraries.
So we can install
our own libraries
and have write access
to that directory.
But what should we do
in those situations
where resources on our
development workstation
aren't adequate, like
running out of memory when
working with a large data set?
So let's look at how we can
package R code in containers
for running on Cloud ML
Engine and Kubernetes Engine.
For those just getting
started with containers,
you can think of them as
a method for packaging
an app with its
dependencies for running
on a variety of platforms,
not just on ML Engine.
So in the next demo, I'll walk
through building a pipeline
based on this architecture.
We're going to import and
analyze our data with BigQuery.
So we're going to
use standard SQL
to query our data set, stage it
as a CSV file in Google Cloud
Storage.
We'll write a training
application with Jupyter,
with ML Notebooks
and R. And we'll
package this up with Docker
and send it over to ML Engine.
We'll write the app such
that the assets, the model
assets that are created
by the training app,
are stored in Google
Cloud Storage.
And we'll take these
assets and package them
in another container that
we can deploy for serving.
And this will leverage Google
Kubernetes Engine, seldon,
and s2i.
All right-- demo.
All right, great,
so first, we'll
navigate over to ML
Engine and ML Notebooks.
I already have-- well, we could
create a new instance, PyTorch.
But I already have
one created, so we're
going to open this one to start.
And I'm going to--
I pulled some code from GitHub.
So I'm going to start here
with a notebook I've already
created.
I have Conda.
I've used Candida to
install rpy2 already,
so I'll go ahead and
load the extension.
I've created a directory
for my libraries.
And I can go and check
this out, and see
that I've installed
some already,
and I have access to them.
And then I can use
the R cell magic next
to install whatever
library I need.
And then we want to
get data from BigQuery.
So we can use the
BigQuery cell magic.
And then our data
will be exported
into a data frame, which
we can then import into R.
So then, once it's in R, we can
view the structure of the data
that we've pulled in.
Here we're looking at public
baby weight data from BigQuery.
And we can also build a model--
just a traditional linear
model-- just to get started.
But we want to do
more than that.
We want it-- so
let's take this data
and stage it in Cloud
Storage so that we
can run some other models,
and we can scale it up
to ML Engine.
So I'm going to go into
the BigQuery console,
and take that query,
and paste it in there.
And this can be done from
within R using the API,
but it's just as easy
to use the console.
So first, I'm querying about
68,000 rows, just for testing.
So I'm going to
have two CSV files.
And I'll export this directly
to Google Cloud Storage.
And this is my storage bucket.
The train data's small,
is going to be that file.
And then I'll change this query
to sample a bit more data.
And we'll use this data set
for when we scale to ML Engine.
So when this finishes,
we'll see that we
have about 13 million rows.
You can also see that
we're doing a little bit
of pre-processing in the query.
And it's good to keep
in mind that BigQuery
is very performant for
pre-processing data with SQL.
So this would be a
good case to use SQL
for some of that instead
of R, potentially.
So we'll export this
CSV to Cloud Storage.
And now we've got my data--
I've got my data
staged for training.
So I'll head back
over to Jupyter,
and I'll start working on
writing my training app.
So first, I create
this trainer folder
that you can see on
the left sidebar there.
And I'm going to create a
file for installing my R
dependencies.
First, I'm going to set
a library path based
on an environment variable.
And then I'll specify
the packages I need.
And then in the
training app, I'm
going to specify that
same library path.
I'm going to configure command
line arguments for Google Cloud
Storage for training data
for the export directory.
I'm going to use caret to
impute missing values, center
and scale my data.
I'm going to change
the factors--
or the features I
need to factors,
partition the data
into test and train,
set up hyperparameter
tuning, cross-validation.
And then I can kick
off my training job
using glmnet for this example.
And when the
training's complete,
it's going to save my
model file, as well
as my pre-processing weights,
to Google Cloud Storage.
And I could also export
any additional assets
that I wanted, like
if I wanted to export
plots or other results.
Then I'll go into Docker
to package all of this.
I'll take the base R image.
I'll install the Google
SDK so I can download
my files from Cloud Storage.
I'll copy over that file
to-- or that script--
to install packages, set the
library environment variable.
I'll copy over the training app.
And then I'll use entry
point to tell ML Engine, hey,
this is the script
that's going to run
when the container kicks off.
So before we go ahead and deploy
the container to Container
Registry, let's just make sure
the training app runs locally.
And then we'll get
the model file saved.
And then we'll also
test inference.
This is an end-to-end
deployment,
so we want to make sure
inference works, as well
as training, before we go
ahead and start training
on ML Engine.
So we see we get
a result returned.
So we'll go ahead and
build the Docker image
and package up our
training container.
And then we can deploy it to
Google Container Registry.
And once it's in
Container Registry,
it's accessible to ML Engine.
So we'll submit a job
referencing that container,
reference our training data in
the storage bucket, the output
directory.
And we can also
specify a machine type
or a cluster of machines here
with accelerators if necessary.
So when the job is submitted, we
can go back over to the console
and use ML Engine to
view the job running.
We'll be able to
see our utilization
metrics and our logs.
And you can add
additional logs as needed
based on your training app.
When the model is
complete, the model assets
are written to Cloud Storage.
So we can go into Cloud
Storage and see the assets
from the prior
jobs that have run.
So I've got a few folders
here from prior jobs.
And we'll look at
the most recent.
And here we can see
the model RDS file
and the pre-processing weights.
So then let's go
back into Jupyter.
And let's serve our
completed model.
So we're going to install
s2i, which is source to image.
It's a tool that's just
going to make it easier
to build the container
image without a Docker file
essentially.
Source to image and seldon
require this environment file.
It requires and
install.r file, which
installs the dependencies.
We're going to download the
model assets from Google Cloud
Storage and include
them in the container.
And then we're going to
setup a runtime file, which
is going to load the RDS model
file, the pre-processing file.
It's going to take the
input from prediction,
do the pre-processing, and then
evaluate against the model.
So once we have
all these files, we
can call s2i to build
our serving image.
So very easy-- we
don't even have
to build a Docker file here.
We just build our
R files for serving
and then package up our image.
We can test it locally and
then run a request against it
locally.
We see we get what
we were expecting.
So we can also view logs if we
needed to debug, and then stop
the container to restart it.
So we'll go ahead and push
the working container that we
have to Container Registry.
And if we didn't have a
cluster already started,
we would create one,
a Kubernetes cluster.
And we can see that we
already have one running.
So we'll deploy the application
to the cluster we have running,
and then verify that
it's been deployed
and that the pod is running
for the application.
And then we'll expose the
deployment on port 80,
and then get an external
IP address for the app.
So we have our
external IP address.
We can make another request.
And we see the result
we were expecting.
So it looks like
we're good to go.
A nice thing with Kubernetes
Engine, or managed Kubernetes
service, we can go
into the console
and view the pod, the
application running.
We can see utilization metrics.
And we can view logs from
the container as well.
So with that, I'll pass
it back over to Mikhail
to wrap up for us.
Thank you.
MIKHAIL CHRESTKHA: Thanks, Greg.
So just in closing,
hopefully this image
makes a little bit more sense.
I think-- at the very top,
starting with the notebooks,
I think some of the key ways
we saw Chris talk about the new
web interfaces module within
Dataproc that allows you
to quickly spin up
these notebooks--
ML Notebooks actually is
a managed Jupyter service
that went, also, beta on March
1-- so also a new service
that maybe a lot of you may
not have heard of before.
But we really made sure
that both demos as well as
my reference architecture
showcased all three
so you got a sense of how
each of those three look like.
We also walked through
different storage techniques,
both from BigQuery
as well as GCS.
And from a Spark
perspective, I think
the auto-scaling is
something that I've really
been using quite a bit
myself as well as working
with a lot of customers.
Because there is no need to
then necessarily right-size
your workloads, or maybe have
a little bit more capacity
than you need sitting idle.
That auto-scaling feature
really has a certain logic
built in when the
number of clusters grow.
And then on the
very bottom right,
what Greg walked
through, we'll make
sure to provide that in
terms of GitHub code.
But the key takeaway is,
again, hybrid and portable.
These are all containers.
You can use them
on Google Cloud.
You can train or create
a serving application.
But those are fully
portable to be ported over
to another cloud,
onto on-premise.
And seldon specifically,
the serving
approach that Greg
walked through,
that's part of the broader
Kubeflow ecosystem, where
we are making it very easy--
easy connectors to the
Kubernetes ecosystem.
A quick cheat sheet around both
the services and libraries--
it is quite a bit.
And we'll try to get some
resources out there online.
But the first two
rows really talk
about the data
ingestion, GCP services,
as well as the R packages that
allow you to access these very
quickly, directly in our code.
We walked through different
ways of doing that within your R
code as well as
directly in the console,
whatever you're more comfortable
with-- various options there.
And then the bottom three talk
about the training service,
the serving service, and
then the actual R library
packages that you can
leverage to access
that in a familiar environment.
Now the last column,
of course, it
does require a little bit
of container knowledge--
Docker, Kubernetes-- but really
does future-proof yourself,
and makes a lot of
your R workloads
portable moving forward.
So what now?
We have a couple
of bit.ly links.
For the Spark
ecosystem, Chris did
publish a blog post
a few months back
that's a great primer on how
to get running with Spark
on Dataproc.
Tensorflow.rstudio.com is a
great repository for resources.
And RStudio and JJ
actually partnered
with the creator of
Keras to publish a book,
"Deep Learning with R," as
well as the great cheat Sheets
that RStudio is known for.
And then I mentioned
the examples
that Greg walked through,
a lot of great information.
That GitHub link is there under
gmikels, his GitHub name, along
with Google Cloud, R examples,
as well as some bit.ly links
for a new feature within
Machine Learning Engine
as well for custom containers,
as well as Kubeflow and seldon.
[MUSIC PLAYING]
