[MUSICAL INTRO]
SERGEI SOKOLENKO:
Hello, everyone.
My name is Sergei Sokolenko.
I am a product manager
in the Google Cloud.
Today, with me, I have
a Christopher Crosby,
one of my PM colleagues,
and Ravi Upreti from Qubit.
And today, we're going to talk
about data processing options
at GCP, Hadoop,
Spark, and Dataflow.
So the session today
will be quite simple.
We're going to give
you an overview
of the options you have.
We're going to give
some recommendations
on what to choose.
We're going to do deep dives
into Dataflow and Dataproc.
And then we're going to try--
we're going to ask Ravi
to come up and tell us
the story of Qubit, and how
they use Dataflow and Dataproc.
So let's get started.
Data processing in the cloud--
Google Cloud-- is part of
a larger data analytics
offering set.
Starting with data
ingestion, you
have access to Cloud Pub/Sub,
which is our messaging
solution for streaming events.
If you need to move structured
or unstructured data
to the Google Cloud, you can
use the Data Transfer Service.
And if you need to
connect your IoT devices,
Cloud IoT Core is the
solution for that.
Now, once you ingest
the data, you have, now,
options to process it.
And what are our options?
If you prefer-- if you're
building a steaming solution,
if you prefer accessing a
state-of-the-art streaming
framework such as Apache Beam,
then Cloud Dataflow provides
a managed service that unifies
batch and steam processing.
If you're coming from the
Hadoop and Spark world,
would like to continue
using these tools,
have a investment in a
skill set like these tools,
like Spark Streaming,
then Cloud Dataproc
offers a managed service.
And that's a great choice.
We also have a service
called Cloud Dataprep.
It's a visual UI tool
for data wrangling.
It's based on Cloud
Dataflow underneath.
It runs Dataflow jobs when
data needs to be crunched.
But to you, as
the end user, it's
exposed through
a user interface.
You don't have to do any coding.
Now, once you ingested
and process data,
you have to store it somewhere.
So where do you store it?
You have two options.
You can store it in
BigQuery, our cloud data
warehousing solution,
or you can store it
in the Google Cloud Storage if
you're building a data link.
Once you stored
your data, you have
access to yet more advanced
analytics products.
For example, you
can run predictions
using Cloud ML Engine.
Or you can visualize
your data using Cloud--
Google Data Studio.
You can also do a full-scale,
full-blown TensorFlow
implementations and run
them in self-managed mode,
or with Kubeflow Or you
can do your spreadsheets
using Google Sheets.
Combining all of this, and
kind of binding this together
in a end-to-end solution, we
have Cloud Composer, which
is a managed Airflow
implementation,
managed Airflow
service, which allows
you to put all these pieces
together in a working
end-to-end workflow.
Now, how do you
choose these options?
Specifically, for
data processing,
there's a very
simple set of rules.
If you're modernizing
your data warehouse,
if you prefer the
ELT paradigm where
you extract load,
and transform, and do
the transforms, the processing
inside of a data warehouse,
then BigQuery is the solution
that you should pick,
because it gives you rich SQL.
You can write your
SQL statements,
SQL scripts, solve
the problem that way.
If you're building a
steaming solution--
if you want
state-of-the-art steaming,
unification with batch,
exposure to Apache Beam--
then Cloud Dataflow is the
best managed service for you.
If you prefer Hadoop and Spark,
if you like Spark Streaming,
if you have a existing
investment into Spark
and Hadoop, then Cloud
Dataproc is a great way
to run open source software
in the Google Cloud.
Now let's do a deep
dive into Dataflow
starting with Dataflow.
But before I do
the deep dive, let
me explain the differences
between Beam and Dataflow.
Several years ago, Google has
open-sourced the Cloud Dataflow
SDK and became a
top-level Apache project.
In 2018, last year, at
the end of the year,
the Apache Software Foundation
shared a couple of stats
about all of their projects.
And it turns out that
Apache Beam is now
the top project-- it was,
in 2018, the top project--
by the [INAUDIBLE] activity.
That was the top-three
project by commit activity.
So there's a lot of
excitement and developments
happening in the
Apache Beam world.
Why?
Well, users like
the fact that it
provides a simple and unified
batch and steaming API.
You don't have to
use different tool
sets for batch and steaming.
You can use one set
of tools-- so one API.
And portability-- that's
another big selling factor
for Apache Beam, or big benefit
of Apache Beam, portability
across runners.
For example, you can write
your Apache Beam pipeline
and run it in the Google
Cloud on Dataflow,
or you can run it
on Apache Flink,
either in a different
cloud or on premises.
In addition to the
runner portability,
Beam also envisions
portability across languages.
So imagine a pipeline that uses
some I/O connectors written
in Java, and some data
transformations written
by another team in
Python, and perhaps yet
another set of transformations
written by yet another team
at your company written in Go.
That's pretty powerful if
you're able to combine all
of these components together.
Now Cloud Dataflow uses
Apache Beam as its only SDK.
This is the SDK
that Cloud Dataflow
is offering to its customers.
It's a serverless, fully
managed data processing
option in the Google Cloud.
Together with Cloud
Pub/Sub and with Beam,
Dataflow offers exactly-once
streaming semantics.
What does it mean?
It means your messages
get processed only once.
Dataflow takes care of
error handling and data
deduplication.
We also optimize used
resources-- not just resources,
but also, time that is spent
on processing your resources.
I'm going to explain,
a little bit later, how
the optimizations are working.
And lastly, Dataflow
offers separation
of state storage and compute.
And this separation allows us to
scale the different components
of your pipeline.
The place where the state
is stored, the place where
the calculations
are done, we are
able to scale them at
different rates, which improves
our autoscaling behavior.
So let me talk a bit about
the vision behind Beam.
And I will use an example
of a typical transformation,
summation of values
based on keys.
So if you want to do a group
buy and sum all of the elements
by keys, it's a very
typical operation.
The vision of Beam is to offer a
variety of such operations, not
just sums.
Summation is just
a simple example.
But all of the APIs
should be available
across different languages.
The languages that are supported
today include Java, Python.
We have a experimental Go SDK.
And we also--
Beam supports Go.
And sometimes I say "we."
I sometimes mean Beam, and
sometimes mean Dataflow.
It's because a lot of folks
whoare working on Beam
are also coming from Google.
But they have a large
community of non-Googlers
contributing to Beam.
Actually, the majority
of people working on Beam
is, at this point, non-Googlers.
So in addition to this
language portability,
the other part of
the vision for Beam
is the ability to run in
different environments.
Imagine you wrote
a pipeline, and you
decided to run it today in the
Google Cloud on Cloud Dataflow.
And next day, you decide, well,
I could actually use locality,
and I could do some data
processing close to my data.
So why don't I run my
pipeline close to the data
in my on-premises environment
using a Flink cluster?
That's pretty powerful.
There's a variety
of different runners
being developed right now.
Flink and Cloud Dataflow
are the closest to--
are the ones I would
use in production.
There are other environments,
or other runners,
that are being developed today.
Apache Samza is another one
that is pretty close to being
used in serious environments.
There's work being done on the
Spark runner and other runners
as well.
So now that we understood
the vision behind Beam,
let's talk about Dataflow.
Lots of our customers
like the fact
that they don't have to
understand this diagram.
For them, Dataflow
is a black box.
It connects sources and sinks.
What happens under the
hood doesn't matter.
For those of you who actually
do want to understand what's
going on, let me
just explain what's
happening in this black box.
First of all, Dataflow takes
care of resource management.
What do I do-- what
I mean by this?
I mean creating resources,
storage resources and compute
resources-- these are the boxes
in the middle of the diagram--
and scaling them up
and down separately
depending on how much
data I have remaining
that needs to be processed,
and based on the prioritization
capabilities of this data.
Some data is more parallelizable
for processing than other data.
And if we encounter a data set
that is easily parallizable,
we will create lots of
resources that will crunch
through your data very quickly.
It will actually save you
money and time for processing.
So the other things
that Dataflow does
is our dimension,
dividing data sets.
That's the work scheduling piece
and dynamic work rebalancing
piece.
So we always keep track of the
small pieces of bits of data
that we have yet to process.
We assign them to
processing nodes.
We track how fast the
work is being done.
If we can improve
the time execution,
we will create new workers
and assign the data
to them, this way, saving you
resources and execution time.
We also do some mundane things
like logging and monitoring.
And it turns out monitoring
is not that mundane.
Some of the improvements that
we've made over the past three
months were actually
about improving
monitoring capabilities.
Let's talk about what's new.
Everything here
was released or has
been launched over the
past three to four months.
So you're looking at
pretty fresh stuff.
We are continuing investing
in our streaming capabilities.
We would like to be the
best-in-class streaming
solution.
One of the things
we launched recently
was the exposure of
two central metrics
for streaming pipelines, system
latency and data freshness.
Freshness, in other parts,
is called watermarks.
So we are exposing them now
in your job details page
right there.
It's visible to you.
You don't have to
go to Stackdriver
to get these metrics.
And in addition to
seeing the statistics,
you can also create alerts
based on predefined thresholds.
We released streaming
support in the Python SDK.
And we worked with
the Beam community
on creating a production-quality
Beam on Flink runner.
If you are interested in
learning more about how Lyft
is using the Beam on Flint
runner in their environment,
come join us on
Thursday at 11:40.
Thomas Visa from Lyft will
be co-presenting with me
about Lyft's usage
of the Beam runner.
In addition to
improvements in streaming,
we also done a bunch
of improvements
on our service side, starting
with a redefined, redesigned UI
for creating jobs.
We made it very simple
to create single-source
to single-sink ETL jobs.
We now offer about 20-plus
different ETL templates.
For example, you can
move data from Bigtable
to GCS storage or import data
from GCS storage into BigQuery.
But lots of combinations
of sources and sinks
you get for free when
you use these templates.
We developed and released a new
connector for BigQuery's Fast
Storage API.
We improved our security--
your security, actually, our
support of your security--
by, for example, not requiring
you to use public IPs anymore.
You can use private IPs
for your Dataflow workers.
Very recently, the
launched support
for VPC Service Controls.
This helps you, preventing
data exfiltration threats.
And lastly, our batch
customers like the fact
that they can join and group
hundreds of terabytes of data
using the Dataflow Shuffle.
Let me show you a
quick demo of how easy
it is to create a new streaming
job that dumps all of the data
from a Pub/Sub topic
into GCS storage,
in effect, creating
backups of streams,
and then creating alerting
for these streaming pipelines
based on predefined conditions.
Demo, please.
All right, so I am currently
in the Pub/Sub browser.
I have a Pub/Sub topic
called "Transactions."
And this topic has a
never-ending stream
of sales transactions, things
like who purchased the good,
the good itself, the sale
amount, the place of sale,
the time of sale.
So what I want to
do, I want to create
backups of data in this topic.
And I can do it very easily
now by going to several
of the contextual menu options.
Not only can I import, I
can actually export as well.
I can export into BigQuery, or
I can export into Cloud Storage.
I'm going to choose the
other option, Cloud Storage.
And within a few
seconds, a new screen
will pop up that will allow
me to complete my flow.
All of the relevant
values are prepopulated.
For example, the
Dataflow template
that implements this
flow is prepopulated.
There are many, many more.
Just look at this list.
I could have exploded and backed
up a Cloud Spanner database
into storage.
Or I could have imported--
or exported-- data from
Bigtable into storage.
But I'll stick with my
predefined template, my Pub/Sub
to text files,
and just finish up
typing in the name
of the destination.
This is the bucket where all of
my Pub/Sub transactional data
will be stored.
I'm going to give every file a
prefix so that I don't lose it
in the output bucket.
And I'm also going to type the
name of the temporary folder.
And I will actually
not forget, this time,
to change the name of my job.
Because I've done a
couple of pre-demo prep
and created several
of the same job names.
All right, within
a second or two,
Dataflow will populate
the execution graph
of this pipeline.
It has three steps.
The first step is
reading from Pub/Sub.
Then this template will
window all of my data elements
in the stream into
five-minute chunks.
And each of these chunks
will be, then, stored
in Cloud Storage.
It takes a couple of minutes
to populate enough data
to enable my latency graphs.
Let me save the time and
just pop into another job
that I launched before the demo.
Here's how it looks after
a couple of minutes.
I now see that my pipeline
has about a couple
of seconds, three, four,
five seconds of latency.
Latency measures the
time between when
an event enters the pipeline
and exits the pipeline.
And I have
appropriate statistics
for freshness as well.
So what if I want to define
an alert for this pipeline
where I want to get notified
if the latency of my system
exceeds 20 seconds?
This is very easily
done now by clicking
on this Create alert link,
where I will be rerouted
to Stackdriver monitoring
with a bunch of values already
pre-populated.
I'm going to keep most
of the filters the same.
Yes, I want to
monitor Dataflow jobs.
And the metric that I want
to monitor is system lag.
And this filter
also works for me.
Job name equals my job name.
What I do want to
change is the threshold.
So 20 seconds-- once
I exceed 20 seconds--
the values are in
milliseconds, so I actually
have to type in "20,000."
So once the value
reaches this value,
I'm going to get an alert.
And let me just save this alert.
And voila, I've got an alert.
I'm going to get
a SMS or an email
whenever my pipeline gets slow.
Back to slides, please.
So what else is new?
There are two other areas where
we've done good improvements.
And I actually wanted
to quickly explain
what all of these
new products mean--
TensorFlow, Kubeflow
Pipeline, et cetera.
They are continuing investing
in our machine learning and AI
support.
TensorFlow Extended
is open source machine
learning platform for production
use on a serious scale.
Kubeflow is a Kubernetes-based
machine learning platform.
It works with TensorFlow, but it
also supports other frameworks.
And Kubeflow Pipelines is the
technology part of Kubeflow
that creates infrastructure for
data processing for Kubeflow.
So the things that we
improved over the past two
to three months is we improved--
we created a integration
point with Kubeflow Pipelines.
So you can now do
feature preprocessing
using Kubeflow Pipelines that
will spin up a job in Dataflow.
Dataflow and Beam are
powering TensorFlow Extended.
All of the major data processing
tasks of TensorFlow Extended
are running on Apache Beam.
And by extension,
they can run either
in the Google Cloud on Dataflow,
or they can run on Flink.
And lastly, we keep investing
in our Python support as well.
For example, we recently
released a Bigtable Python
connector.
We launched support
for Python 3.
And I already mentioned
our streaming capabilities
that we released in beta.
And with this, I
would like to invite
Chris to talk more about Hadoop
and Spark in the Google Cloud.
CHRISTOPHER CROSBIE: Thanks.
Hey, everybody.
I'm Christopher Crosbie.
And I am a product manager on
our open data and analytics
team.
And so I'm here to talk about
how Spark and Hadoop, combined
with Cloud Dataproc, can
help you do processing
on cloud-scale data sets.
So our goal with
Cloud Dataproc, it's
to let you take the open
source tools, algorithms,
and programming languages
that you're using today,
but apply them to
cloud-scale data
sets while, at the
same time, helping
you integrate with the rest
of the Google Cloud ecosystem.
So I think we've
established this already,
but Cloud Dataproc is Google
Cloud Platform's fully managed
Apache Spark and
Apache Hadoop service.
And while I'm saying
Hadoop and Spark,
it's really an open source--
an engine for running
open source software
in that ecosystem.
We offer customizable
machine types,
which means that if you have a
set of machine learning jobs,
those could live on
one cluster that's
very compute-intensive,
while, at the same time,
you can have a set of
BI or ad-hoc analysis
applications that live on
a memory-heavy cluster.
The two never have to
contend for resources,
but they could both be pointing
at, reading from, and writing
the same data sets in
Google Cloud Storage.
And we offer you a lot of
tools and features specifically
designed for managing
those type of architectures
and ephemeral Hadoop and Spark
clusters just in general.
We also give you
tight integration
with the rest of
Google Cloud Platform.
So this is not just us giving
you yet another cluster
that you have to manage.
We are going to let you run your
existing open source software,
but still modernize your stack.
Dataproc also gives you
a lot of flexibility
and a lot of different
knobs you can turn.
Now we do our best to
create some sane defaults
for those knobs,
so it's really easy
to just get going
if you don't care.
But because we do
expose those knobs,
we find that
customers use Dataproc
in a couple different ways.
The first way is what we refer
to as job-scoped clusters.
And this is really efficient for
batch- and ETL-type processing.
In this mode, essentially,
what you can do
is have a single command
that sends a graph of jobs
to Cloud Dataproc.
We will spin up a right-sized
cluster, run those jobs,
make sure that your
cluster's torn down,
and then save all that
information into Stackdriver
to make sure that you have a
full record of everything that
happened while that
cluster existed.
And because Cloud
Dataproc can usually
give you a fully loaded Hadoop
cluster in around 90 seconds,
this is a really effective
model for starting
to think about jobs and
clusters as a single entity.
And so customers have a lot
of success with this model.
And we have a lot of features,
like our Jobs API, or workflow
templates, or clustered
scheduled deletion,
that helps Dataproc
customers with this model.
Now we also have
customers that tell us
they have scenarios for
semi-long-running Cloud
Dataproc clusters.
Examples of this
might include things
like a sharded cluster for
interactive or ad-hoc analysis,
often with web notebooks
like Jupyter or Zeppelin,
or BI applications built
on true tools like Druid.
And we have plenty
of features that
help with this as well
in Cloud Dataproc, things
like our high availability
mode or our autoscaler.
And if you want to
learn more, I actually
did an article a couple
months back on my top 10 tips
for running long-standing
Dataproc clusters.
There's a webinar
associated with that.
So there's a lot of materials
out there on the web
already if you
want to learn more
about how to go and
use Cloud Dataproc
for long-standing clusters.
But regardless of which
model that you choose,
Cloud Dataproc has some pretty
big differentiation among the--
compared to other cloud
providers and the competition
here.
So with our features like
autoscaling and workflow
templates, we make
it really easy
to manage both ephemeral
and long-running clusters.
We have both a high
availability mode as long
as straw-- as well as strong
consistency on our storage back
end, which is Google
Cloud Storage.
And we even received
external validation
from an analyst firm, ESG,
that Cloud Dataproc's TCO
is less than half
of on-prem clusters,
and even 32% than some of our
cloud competition like Amazon
EMR.
So just to give you a feel
for some of these features,
I'm going to quickly jump into
a demo that's going to show off
some of Dataproc's features.
Actually, you can
go back to-- there
was a video on the last
slide-- in the slide deck.
There we go.
We'll just leave it
like that for now.
But essentially, what
this demo's going to show
is-- my motivation
for this demo is
that, as a product
manager for Google Cloud,
I find I spend a
lot of time writing
blogs and articles that I
go out and put on the web.
And what I'd love to have
is a machine learning
model that told
me, is this going
to be a good article or not.
Are people going to like this?
And so whenever I have
questions like that,
the first place I
always tend to go
is our BigQuery public
data sets program,
which has over 185
different public data
set that I can take and
use to augment my analysis.
And so the public
data sets program
actually had a perfect data
set for my use case here.
It has a Hacker News data
set of every article written
in Hacker News, associated
with a score of how
well that article
did on the internet.
That score is a combination
of upvotes as well as relinks
and there's a couple
of other things
that go into that algorithm.
But essentially,
it's a label that I
could use for my machine
learning application associated
with the full text
of an article.
And so what I'm doing
here is I'm just
going to make sure I have
scores greater than zero
as well as a length of
text greater than zero.
And that way, I'm just
cleaning up the data
to make it a nice data
set that I can process.
And I'm going to take just the
first thousand so I can better
understand how to
work with this data.
And you can see, what comes
back here is a set of data
that I can use to do
this machine learning.
Now BigQuery is awesome for SQL.
But I'd really love to take
this out into Cloud Dataproc
so I could do some things
in R, and with Spark machine
learning, and some other things.
So what I'm going to
do here is I'm just
going to save this thousand-row
sample as a Hacker News sample
data set.
And then I'm going to jump
back into Cloud Dataproc.
Now in Cloud Data--
as soon as this
job completes, I'm
going to jump into
Cloud Dataproc.
Now within Cloud
Dataproc, what we do is,
through our
component gateway, we
expose the web UIs
for components that
are running on your cluster.
So because of my
cloud IAM permissions,
I can jump right into
a Zeppelin notebook,
which is a notebook
for working with Spark.
And the first thing I'm going
to do here is just show you
that that query I
just ran in BigQuery,
I could have run that
directly in Zeppelin.
There is an interpreter for it.
But what I really
want to do here
is I'm going to pull that
data into a Spark data frame.
So I have this Python function
that accepts an arbitrary table
name.
And I'm going to pass it
my Hacker News sample.
And just like that, I've now
converted that BigQuery table
into a Spark data frame
that I can work with.
And so I just want to do a
quick check on the counts
and double-check that, yes, I
do have a thousand rows coming
back.
I'm good to go.
And the first thing
I want to do now
is I'm going to reg--
with one line of code,
I'm going to take that
data frame, register it
as a temp table.
What that lets me do
is work across all
of the languages of Spark.
Whether it be Scala,
SQL, Python, R,
I can now just move
interchangeably
between languages against
that same data set just
by registering that table.
And you can see, now,
I've flipped into R.
And so I was in Python.
Now I'm using R against
that same data set.
And all the familiar
things I would
do in R, things like a quick
filtering, and aggregations,
and transformations, I can now
do those on cloud-scale data
sets without having to
learn something new.
I can now-- and
there's two ways I
can start to scale my
analysis across Spark.
So first of all, I
could do something
like a list apply
in R, where I could
take several models all
at once with, maybe,
different hyperparameters and
send those all across Spark
to all be trained, come back.
And then I could pick
the model that had
the best hyperparameter tuning.
But what I'm actually
going to show here
is another technique
with a deapply, where
I'm just going to run an NLP
algorithm across that full data
set.
And that full data set here
is only a thousand rows.
But it does give me enough of
an indicator that, like, yeah,
I'm on the right
track with this model.
I want to see how it looks
with-- if I'm trained
against the full data set.
So the idea here is I could
send it a note of my--
some text of my own and
have a score for how that's
going to do on the internet.
So all I want to do now to
train the entire model is
I'm going to change my table
name in the same exact notebook
from "Sample" to "Full."
And then I'll just
go ahead, and I'm
going to rerun that
entire notebook.
All that code
steps you just saw,
I just rerun that, but now on
the full Hacker News data set.
So as that's running,
what I want to do
is I'm going to flip back over
to Cloud Dataproc's console
just so I can show you what's
happening underneath the hood.
So you can see I started
with-- that whole time,
I was in a three-node--
or a three-VM-- cluster, which
was totally fine for working
with my thousand samples.
But in about two
minutes, what happened
is my yarn-pending
memory spiked.
And that's because I'm
pulling in that BigQuery data.
And so now the
Dataproc autoscaler
went ahead and added
two nodes to my cluster
to help me out there.
But then what happens
is, you can see,
there's another spike
in yarn-pending memory.
And so the Dataproc
autoscaler has now
spun up a full-sized cluster
to help me process that.
Once the model
training completes,
I can simply hit
Refresh here just
to show that the
autoscaler is now
going to take away those nodes.
Because I don't need them
anymore, because the model
training completed.
And you can see where
there was a spike.
The memory wasn't
needed anymore.
And that's why the autoscaler
took down those nodes for me.
And finally, if I wanted to save
all of this off without having
to have a running cluster, but
be able to retrain my model
at any time-- or I
could hand it another--
my next article and have it
run against that article, that
algorithm again--
I could simply use
a workflow template.
And what that looks like
here in the console, when
I go to workflow templates,
is I can simply click Run
after I've set that up.
And that will spin up a
right-sized cluster for me
that will run those jobs and
then take down that cluster.
And so Cloud Dataproc--
we are really trying to
change open source software
to be cloud-native.
And so this means
taking features
that are fast, easy,
cost-effective and building
around those.
And so we launched more than
30 features last year alone.
So this is still an area
Google is heavily investing in.
If you want to know more
about specific features,
we're going to do an
entire Dataproc features
session tomorrow at 11:00.
We'd love to see you there.
But for now, I'm
going to hand over
to Ravi, who's going to talk
about how Qubit's actually
using some of
these features now.
[APPLAUSE]
RAVI UPRETI: Thanks, Chris.
That was a great demo.
Autoscaling in
action-- really good.
Hi, everyone.
I'm Ravi Upreti.
I'm the platform
engineer lead at Qubit.
At Qubit, our vision is
to have a world where
every customer can have
a personal experience
with the brands they love.
We want to create
the greatest customer
experience in milliseconds.
We want to-- our
mission is to drive
customer loyalty and lifetime
value through personalization.
We work with
enterprise customers
across three main verticals--
gaming, E-gaming,
travel, and retail.
In order to power
these personalizations,
we collect and process a
lot of behavioral data.
Let's have a look
at some numbers.
So at Qubit, we process
more than 120,000 events
per second at peak times.
We have petabytes of data
stored across thousands
of BigQuery tables and other
Google Cloud Storage solutions.
We power more than 55 billion
personalizations per month.
That is around eight per
person on the planet.
Google Cloud enables
us to operate
at that scale with ease.
Let's consider a use case,
a personalization use case,
where you want to-- where a
visitor is looking at a product
page, and you want to show
some recommendations along
with that product.
So how do we go about it?
It all starts from a user
browsing to the site,
looking at different categories,
at different products.
And as the user journeys
through the site, we create--
we trigger a lot
of behavioral data.
We collect and process it.
The architecture here shows
how a pipeline would look
like that solves this use case.
We will revisit
this architecture.
But this time, step by
step, we follow an event
from the start to the end as it
travels through Qubit's systems
and see how we use
the technologies which
are relevant to this session,
where Dataflow and Dataproc fit
into the picture.
The event that we
are interested here
is going to be a
product event, somebody
looking at a product page.
The very first thing happens is
a user is looking at a browser.
It triggers an event--
or a mobile app.
It triggers the event.
It enters Qubit via
a Qubit gateway,
where it starts with some
meta-information and moves
forward to our very
first data flow--
that is, the
validation data flow.
We, internally, call
it the gatekeeper.
As I'm talking about the first
data flow in the architecture,
let me share our
experience with Dataflow,
how we manage deployments,
and how we monitor them.
Our experience with
prototyping ideas with Dataflow
has been really good.
Since most of our code was
in Java for the old pipeline,
it was really easy for us to
actually try those things out
in Dataflow.
Now, looking at
the Beam version,
obviously, there is
support for more languages.
So I am sure it will be
easier for many of you here.
It was, in fact--
we were actually
able to move really
fast, iterate and validate
ideas really quickly.
In fact, it was so easy to
code and run these data flows
that it almost gets you into
this thinking called data flow
thinking, where you almost want
to solve all the problems using
a data flow.
It is that easy.
In terms of deployment, we
have our own in-house Dataflow
launcher that we use to
deploy these data flow.
It lets you seamlessly
deploy data flows.
You don't have to
worry about bots.
You don't have to worry
about options, the plumbing
code around creating
Pub/Sub subscriptions,
and things like that.
By the way, we have
open-sourced it.
So if you look on GitHub, you
should be able to find it.
As you can see here, we
also have schema registry.
So obviously, in
this journey, we
are following the product event.
But in reality,
we have thousands
of different events, which
have their own structure,
own customizations per
client, and things like that.
In order to manage that, we
have our own web service,
which is schema registry, that
helps us manage those schemas.
I'll quickly talk about
monitoring as well.
So we use a mix of Stackdriver
and in-house parameterless
monitoring.
We started with Stackdriver.
And it works really well-- it
has worked really well for us.
We use it for monitoring things
like undelivered messages
in Pub/Sub, or what
is the system lag,
and things like that.
But we have a very unique--
we have very certain
unique requirements
where we are interested in
not just the freshness of all
the events, but actually,
this product event
for a particular client.
That's why we host our
own monitoring as well.
So yeah, moving on,
validation data flow,
it looks up the schema
from the schema registry,
validates it, if it's
all good and legit,
moves it forward to our
next data flow, which
is the enrichment data flow.
So suppose this event--
we want to know
whether it was actually
triggered by a browser
with human interaction
or whether it was
a bot doing that.
So things like bot
detection is what we
do in the enrichment data flow.
We also do things like
currency conversion, time zone
conversion, geolocation,
and things like that.
Basically, it adds a bit of
a golden touch to the event.
We, internally, call it--
no points for
guessing-- but Midas.
After enrichment, this
event is pushed forward
to our next data flow.
This is the
persistence data flow.
The persistence
data flow makes sure
that this particular product
event enters the right table
in the right data set.
As you can see, we have
schema registry doing a cameo
appearance again here.
But this time, it's
responsible to give us
BigQuery-specific
schema that we can
use to translate our product
event into a table row type
and look up things like
meta-information, like what
are the partitioning columns for
BigQuery and things like that.
This completes our
event's journey,
starting from the
browser, traveling
through our various data flows,
and enters into BigQuery,
our data warehousing solution.
A bit of a detour here--
so we, at Qubit, extensively use
Dataflow in batch mode as well.
One of the use case-- so
just very similar to the one
that we are talking
about, suppose
a client has an in-house
generated recommendations.
And what they want to do is use
the Qubit's one and their own,
merge them together, and create
more powerful personalization
experiences.
So here, in this,
as you can see,
an automated system
creates recommendation,
pushes it into Qubit, to GCS.
We have services looking
for these object chain
notifications via Pub/Sub.
We spawn on-demand data
flow that reads the file,
can autoscale if
the file is big,
and finally pushes data to
Bigtable, and to BigQuery
as well.
A quick note on unified
programming model--
we really love the
unified programming model.
Running the same code
base in streaming mode
and in batch mode
is really good.
You can think-- you
can think of use cases
like-- suppose you
have to do migration,
and you want to have
the same code base.
So currently, if it's
reading for Pub/Sub,
but you want to
migrate data from--
let's say you have
Kafka queue running--
to your new pipeline,
all you have to do
is just switch the
sources around.
So yeah, as of now, we
have covered the first part
of the architecture, which
is our real-time streaming
pipeline, where an event has--
you know, it's all happy
and good in the BigQuery.
Now we'll move on
and do something
useful from that event.
So this brings in
Dataproc into the picture.
So our recommendation workflow,
this is how it looks like.
Basically, that product event--
we'll look at some of
the interesting things
from that event-- what are,
let's say, some attributes,
like, was it a blue color
shirt, which category
did that product
belong to, and things
like that, things that are
required for our recommendation
algorithms.
We do that feature
extraction, we
prepare the input for our
recommendation algorithms,
and put it on GCS.
Next up, we spawn on-demand
Dataproc clusters,
submit these Spark jobs, and
produce the recommendations,
which are now ready to be
consumed by our rec service.
So this-- so now,
as we can see, we
have sort of
completed our journey
where a product from the
browser entered BigQuery.
We generated some
recommendations out of it.
It is back on that page.
So we've completed
the whole cycle.
As I was talking about creating
these on-demand cluster, as we
have more and more
clients coming in,
more and more recommendation
algorithms given by the data
science team, we were
spawning lots and lots
of on-demand Dataproc clusters.
In fact, we had thousands
of Dataproc clusters running
in a day--
spawning and running in a day.
We have also, one thing--
if you follow that approach
of actually creating on-demand
clusters, two things that
I'll mention is, first of all,
as I said, the scale
that we have used--
so in case you had that
doubt in your mind, what
if we have to create
thousands of these clusters,
we have used it in production
for a long time now.
So it should work for you.
The second important
thing is, if you
do go with that approach, do
have a look at schedule cluster
deletion.
Because sometimes, if, for some
reason, your jobs are stuck,
and you have these
clusters being spawned,
you probably don't want to
make your bill go higher
and waste resources.
Currently, we are
also experimenting
with using long-lived clusters
with autoscaling turned
on so we can use the on-demand
clusters and a single cluster
in tandem.
In conclusion,
fast prototyping--
it's really easy to prototype
and validate your ideas
in both of these systems.
So if you're
thinking about it, I
would recommend, just go for
it, and see it for yourself.
Serverless makes our
life really easy.
Autoscaling gives you
that peace of mind,
whether it's peak
time or off peak time,
it's going to manage the
resources efficiently.
New features-- Google has
been great in launching
new features.
And as they launch new
features, we retire our code.
And we love that.
Who doesn't really love just
removing old code that you
don't need to use anymore?
Last, there are a lot of online
resources that you can look up.
There's a wealth of resource
out there, for example,
"Writing dataflow pipelines with
scalability in mind" by Reuven.
There are a lot of blog
posts actually written
by Sergei as well.
So you should have a look.
In total, all these things
have made our life easy.
We can focus on
creating new things
rather than maintaining--
rather than maintaining
systems and worrying
about days like Black
Friday or Boxing Day.
And we, at Qubit,
can now focus more
on our vision, where
every customer has
a personal experience
with the brands they love.
On that note, I'm going
to hand it over to Sergei.
[APPLAUSE]
SERGEI SOKOLENKO: Thanks, Ravi.
All right, let's wrap up.
I have a couple of
announcement at the end.
We also have time for questions.
We have five minutes
for questions.
Please start lining
up at the microphones.
We would be happy
to answer them--
nope, not yet.
So Dataflow-- hopefully, I
was able to show how Dataflow
continues investing
into streaming,
makes it really easy to
work with batch processing,
is open, and scales resources,
and optimizes your time
processing.
Dataproc is an easy,
fast, cost-effective way
to run open source
software in the GCP cloud.
If you would like to learn more
about Dataflow and Dataproc,
come to these sessions on
Wednesday and Thursday.
I would be happy to share more.
I also have a announcement
to make, a small one.
If you would like to learn
more about streaming systems,
you can have--
you can buy a book
and get it signed
by the members of
the Dataflow team
who wrote the book
on streaming systems.
They'll be here on
Thursday at 1:00 PM.
[MUSIC PLAYING]
