[MUSIC PLAYING]
SUDIR HASBE: Thank you
for coming to the session.
I know it's the afternoon.
I will try to make sure I
don't let you fall asleep.
We're going to have an
exciting set of product demos.
We have a couple of
our customers sharing
their stories, how they're
using the platform,
and the whole goal here
is to go ahead and--
through the morning
in the keynote,
did you get a chance to see
the smart analytics demo?
Was it good?
Great.
So the key thing is we want
to take that whole theme
about smart analytics, give
you more context of what
other product announcements
we are making,
what are we launching,
and give you
all the details that you need.
But before that, let's
talk about what's
changing in the world, right?
If you think about--
industries across the
board are changing.
If you think about automotive
industry, historically--
10, 15, 20 years back, when you
looked at automotive industry--
it was a very
different industry.
Now with organizations
like cruise automation,
they're collecting
real-time information
from all of their vehicles,
doing self-driving cars,
making real-time decisions
at massive scale.
And so the whole
ability to go out
and create these large amounts
of data and make decisions
is super important
in organizations.
The second key thing is--
another great
example is AirAsia.
If you look at
democratizing data insights
within organization,
it's becoming
more and more critical.
So AirAsia last
year at Next, they
were here with me, at the
session, and they were sharing.
They've saved roughly 5% to
10% on their operational costs.
In an airline, that's
a pretty large number.
And this is possible
only because they
were able to take all
the data and insights
within the organization,
make it available for all
of their users who
are actually doing
different kind of activities
within the organization.
So it's very critical not
just to have infrastructure
that can scale to do
massive amounts of queries,
but also to go out and
make those insights
available to everybody
within the organization.
And then there is the
other aspect to it.
A lot of different industries,
a lot of different customers,
are like Broad Institute, which
was founded by MIT, Harvard,
and Harvard hospitals.
They're basically producing
12 terabytes of data per day
and then doing--
leveraging the cloud
computing infrastructure to do
genome sequencing
every 12 minutes.
And so it's interesting
to see how organizations
across different
industries are actually
using cloud computing,
especially our data analytics
platform, to take
massive amounts of data,
derive insights from them, and
make decisions at every point.
We're seeing momentum across
all different industry
verticals, all different
regions, with the platform.
A few key interesting facts I
wanted to share with you today.
One's with BigQuery.
We now have more than
exabyte amount of data
managed by BigQuery
for our customers.
We have-- our largest
customer now has
more than 250 petabytes of data
in a single data warehouse.
And we had last year
roughly 300% growth in data
analyzed on the platform.
So it's fascinating to
see all this growth--
organizations across the
world leveraging the platform,
leveraging the
insights that they
can gather from Google Cloud.
So let me talk a
bit about what's
our philosophy around
our investments
in our analytics platform.
Our main goal is--
we have this theme called
radical simplicity.
Our goal is to make sure
deriving insights from data
needs to be super simple.
You get to a point where
anybody within an organization
should be able to do that.
And how do we do that?
How do we make it happen?
One, the most important thing
is investing in serverless.
You should be able
to bring any data,
do not have to worry
about infrastructure,
put it into a Google Cloud,
and start analyzing it.
The second is, you're providing
comprehensive solution that
provides the end-to-end
lifecycle of your data
management.
Then embedding
ML-- not just using
ML to improve our products
but also making sure
ML is available to everybody
within the organization.
And then we are a firm
believer in open cloud.
We are a firm believer in
making different open source
components available
to you, to run at scale
within our environment.
And finally, all the
enterprise capabilities
that you all expect us to
have within the platform
is super important.
So quick visual-- the key
thing about surveillance data
platform is, one, where
in traditional platforms
you would have to go ahead
and spend time in figuring out
what's your capacity
requirement, how many servers
do you're going to need, what
is the provisioning, what's
the monitoring-- there's so
much stuff goes into that.
But our key thing
is, you shouldn't
have to worry about all that.
That's all managed by Google.
We take care of that.
You just bring as
much data that you
need to bring, start
analyzing it from there.
And then from our
platform perspective--
I know there are a
lot of logos here.
I'm not going to go in
depth of every one of them.
But from an end-to-end
lifecycle perspective,
we have services that
allow you to ingest data,
and that could be at
real-time streaming at scale,
like Pub/Sub.
You can do millions
or billions of events
per second collection.
There's services for
transferring data
from on premises, different
SaaS applications.
We have service for
IoT data coming in.
So all the ingestion services
are available to you.
Then for all the real-time
and batch processing,
we have Dataflow, which is our
streaming engine capability.
It allows you to do a batch
and streaming with single API.
With Dataproc, you can do manage
Hadoop and Spark environments.
Dataprep allows
your data analysts
to go ahead and
do data wrangling.
All of these are
available to you.
And then in data warehousing,
you have BigQuery.
You can use Cloud Storage
for storing massive amounts
of unstructured data.
And then on advanced
analytics side,
you have our Cloud AI services.
So that's a whole portfolio.
You have a whole set of things.
In addition to that,
we have Cloud Composer,
to manage air flow
for you to go ahead
and do workflow orchestration.
And then we are announcing
two new services--
you heard about them today--
around Cloud Fusion.
We'll talk more about
that, and then Catalog.
So that completes the whole
portfolio that we have.
And then, with that,
let me share a few more
things on different scenarios.
When we talk to customers,
there are three main scenarios
that our customers
leverage the platform for.
And Thomas, earlier
today, mentioned these,
and I will try to take
them into the next step
and give you more details.
But one is modernizing
data warehouses
so that you can
go out and make it
broader than just
data warehousing
for reporting and dashboarding.
It's more about intelligent
decision making, predictions,
and stuff like that.
So we'll talk more about that.
The second is running
large-scale Hadoop clusters
on premises that customers
are running, moving that
into cloud to get
much better TCO
but also get scalability
that Cloud can provide.
And the third is
streaming analytics.
I think by 2025, more
than 25% of the data
generated will be
in streaming form.
And as the industries
are changing,
you will need capabilities
that can collect this streaming
data, make real-time decisions
on them, in the application
that you're in and all that.
So that is super
critical, and that's
what customers are using.
Other than that, we heard
a lot from our customers
about breaking the
data silos, making
it easy to get data
into the platform,
and then also protecting
and governing the data.
So we have those
solutions available.
So let's talk about
the first thing.
Earlier today, you
saw a demo of Fusion.
So Cloud Fusion, basically, is
our fully-managed, code-free
data integration service.
The whole idea is,
we want to make sure
that bringing data to GCP is
super easy for our customers.
Data Fusion is actually based
on an open-source project called
CDAP, and it gives you a visual
tool to go ahead and drag-drop,
pick from hundreds of
connectors that we have already
got for you for on-prem systems,
different applications and all.
And then you can go ahead
and transform the data that's
coming in, and
you can publish it
into any one of the data
stores that we have--
could be BigQuery,
could be Cloud SQL,
could be any one of the other
data stores that's available.
The key aspect to this is--
the goal here is just
simplifying migration
of your data to
Cloud, transforming it
as it's coming in, and making
sure you have a single place
to manage all your
data pipelines.
And then finally, it
provides you ability
to go ahead and do
visual transformations.
As you're coming in, you can
go ahead and track lineage
about the data that's coming
in and provide data quality
on top of the data.
So this is one of our big
releases for this Next.
It's available in Beta, so you
can go ahead and leverage it.
You saw some of the
demos earlier today.
There are two other things
that we have in the same realm.
We basically-- we
have in BigQuery--
if you have used BigQuery,
it has connectors that were
available from our first-party
services like AdWords,
DoubleClick, and all of that--
make it easy for
customers to bring data
in from these applications,
and then put it into BigQuery,
analyze it.
We have extended that to
our partner ecosystem.
And so now, I'm happy to
announce we have more than,
I think, 135 connectors
across different applications,
starting from Salesforce,
Marketo, Adobe Analytics,
Facebook analytics, Workday,
all the different SaaS
applications.
It's now available and
customers can use that.
In your BigQuery environments,
you all can start using that.
And the third thing
is, we know there's
a big challenge on migrating
the traditional data
warehouses that are
running on premises--
let's say, like Teradata, or
if you're using Redshiftt--
we have tooling now available
to easily migrate those
to BigQuery.
So that's the key
thing, that we're
providing all this tooling
to make it easy to bring data
into GCP so that you
can start leveraging
the other capabilities
that we already have.
KEITH FERGUSON:
Customers want us
to be able to help
them understand
their business better.
They don't just want
us to do their banking.
Our employees' expectations
are changing as well.
They'd like us to provide them
with relevant data and insights
so that they can make smart
decisions in a timely manner.
And so to do that, we need to
look at digital transformation,
and a key part of that digital
transformation is data.
SUDIR HASBE: Got it.
So can you share some use cases
with BigQuery or other things
that you're using so that we
can get more insights into what
you're doing?
KEITH FERGUSON:
Yeah, absolutely.
Ivan, do you want to talk about
some of the BigQuery uses?
IVAN LIU: Yeah, of course.
So BigQuery has been
one of the key tools
that our data scientists
are using on a daily basis,
and it actually,
effectively helps
us a lot in terms of
the skill abilities
and handling those heavy
computational queries
on top of the
different data sets.
So I will give you a
real story from our team.
So some of our
data scientists are
working on those-- using
those customer transaction
data to build those aggregated,
and deidentified insights
for our
institutional-level clients
for them to understand
the data customers better.
So those analysis including,
what's your loyal customer
look like, where
they are living,
and who are lapsing
from your business?
So we are analyzing billions
of transactions of the data.
And back to that time, it
was around 17 terabytes
for a single table
for us to run,
and it took literally five
days to extract the data
and look at the insights from
data set, which is actually
quite costly for us to deliver
our insights to our clients
and also limit our data
scientists to continuously
develop new insights, adding
new innovations to the data set.
And by moving that whole
pipeline to the BigQuery
we successfully reduced
the times from five days
to 20 seconds,
together, in size,
which is a bigger
achievement for us
and not only enhancing the
efficiency for our data
scientists, but also
allow us to start
resyncing the data science
process in the organization.
So our data scientists start
to meet our clients directly
rather than over-automizing
those query that's
sitting at the backend.
And that they are bringing
their insights to the clients,
taking the direct
feedback for clients,
to get-- even together,
conducted a customer-led design
workshop to get those
customized insights requirements
to support our clients better.
So the reason we are
taking customized insights
is, we have the
confidence that BigQuery
can help us to handle those
heavy computation backend.
And we've managed to work
with airline industries
to help them analyzing
their customer's shopping
behavior before and
after flying so they
can use those insights to
optimize their campaign
effectiveness.
Also, we've been working
with a few retail industry
companies in
Australia to help them
to identify which location
is the best for them to open
a new store.
So such customized
analysis help us
to position ourselves not
only a service provider,
but also a strategic partner
from the data and analytics
side-- perspective--
for our clients.
And currently, we have
streams of data scientists
working on bringing more
data like payments, supply
chain, and credit-rating
data on the GCP,
and, too, combining those
different data sets together,
commingling those data sets,
to unlock the value of the data
sets in the bank.
SUDIR HASBE: I think
that's awesome.
I just heard five days,
roughly, to a few seconds
is where we were
able to drop it.
That's the power
of, at scale, what
you can do with data
processing and analytics.
And I think super
interesting to hear.
Can you share more about
Cloud Composer usage,
how you're using Composer
for orchestration and all?
IVAN LIU: Yeah, of course.
So our team has been exploring
different orchestration tools,
and we've be using Composer
since it was in alpha version.
And it's a great tool
for teams to keep going,
and our data scientists
are loving it
because it's Python-based,
and it's very easy
to manage your dependency
on a multilayer of the data
pipelines.
And we've currently got daily
and a weekly data pipelines
running on a
Composer to generate
hundreds of features
and terabytes of data,
multi-terabytes of data.
However, with the growing of the
teams and complex of the data
pipeline, we're sort of
meeting the challenge,
like running
multi-tendencies on Composer.
And we've been very excited
to hear more announcements
from Composer at this time.
SUDIR HASBE: That's great.
We'll share some today.
So Keith, can you share more on
how did you make the decision
to go to GCP, and what
should everybody here,
especially in the
industries like yours,
think about as they
move to Cloud--
making that decision?
KEITH FERGUSON: Absolutely.
So I think, when we're looking
at moving to a cloud provider,
one of our key
requirements was we
needed a provider to help us
get the most out of our data.
And so the core data
capability is very important.
Services on top of data,
AI services, ML services,
and partnering with someone
who has those services-- also
absolutely critical.
But also, data doesn't
live in a vacuum.
And where the data is where
application delivery starts
to converge to.
And so when looking
at GCP and Google,
we found a provider that has
those AI and ML services--
also has the application
delivery components.
And so we're also
very heavy users
of GKE, Cloud SQL, and a number
of other components as well as,
then, the underlying
data capability.
And so I think, as an
ecosystem, that's great.
As a financial
services organization,
there are a whole other
suite of considerations
that need to be overlaid
on an implementation.
And so as a heavily
regulated industry,
it's very important that,
when implementing a cloud
environment, it's not only
the awesome technology that's
there, it's balanced
with great controls that
can meet the expectations
of your regulators, that
can ensure that you hit those
privacy expectations, again,
of regulators but also
of your customers.
And I think, as--
then, a final point--
an interesting piece around the
Cloud implementation journey
is once you're
there, things become
a lot faster in terms of your
ability to deliver on them.
But it can then shine a bit
of a spotlight on yourself
internally as an organization
and your processes,
and its ability to actually
internalize and deliver
on that change.
SUDIR HASBE: Good.
Thank you.
Thanks a lot for sharing.
KEITH FERGUSON:
Thank you very much.
SUDIR HASBE: Thanks, Keith.
Thanks, Ivan.
[APPLAUSE]
So it's very
interesting, as we've
seen in last couple of years,
how different industries have
been starting to
adopt Cloud, starting
to use some of our analytics
capabilities to go out
and leverage it for
different scenarios.
With that, let me share
a few things around
what's coming new with BigQuery
in this conference, what
are the different things
we are announcing.
One is, we had a goal last year
to go out and launch BigQuery
everywhere.
We have been steadily increasing
our footprint globally.
This is super critical
as organizations
want to keep their data
in specific geographical
locations.
So we've launched around 12
regions in last one year,
and we'll continue the
momentum going forward,
make sure we are available
in every region wherever
Google data centers exist.
So the work is not done.
We will continuously do that.
But we are already in all
of these different regions.
We should near a region
wherever you are now.
The second big thing that
we are announcing today
is, basically, BigQuery supports
two different pricing models.
It has on-demand
model, where you
can go ahead and, per-query,
pay for whatever data
you're accessing.
The second model is, we have
a flat-rate model, which
gives you price predictability.
And you can go ahead
and buy out x number
of slots for the whole
month, and then you
can go ahead and use that.
What we are announcing
today is two things.
In Alpha, we will have our
reservations API, which
will give you two capabilities.
One, you will be
able to go online,
and then if you're
registered for Alpha,
you can start buying
slots directly,
which means you can go out
and say, I want 2000 slots.
But we are also reducing
the entry on that,
and we are making a 500-slot
BigQuery flat-rate available,
which would reduce
the cost of entry
if you want to get
started at a lower level.
So that's one.
The second thing
it allows you to do
is, you can quickly and
easily manage resources.
So let's say you
have 2000 slots.
You want to distribute them
into four different teams,
and say, hey, everybody
gets 500 each.
So that you don't
go ahead and have--
different kinds of queries
can have different priorities
and stuff like that.
So you can go in and
do that that way,
but we always make sure
that you have access
to all of those unused slots
available for everybody.
So the key thing is, the
compute resources that you
have is always available.
You can use them, but
you can allocate them
across your organization
very easily.
This has been one of the
asks from our customers
for quite a few time now.
So this is going
to be available.
The second thing that
is available-- earlier,
a couple of months back,
we announced Storage API.
So there are a lot
of organizations
who are putting all of
their data in BigQuery.
BigQuery Storage is their
structured storage layer
for all of their data
in the organization.
And SQL is a great language
for a lot of things,
but not for all things.
And so we have a
lot of customers
who wanted to use Spark or
Hadoop on top of the same data
that we have.
Why do you want to have
the same data copied
in GCS and BigQuery and all
the different storage layers
so that you can process it?
So we basically have a
high-speed Storage API
available.
With this, what happens
is, your BigQuery that's--
the data that is
stored in BigQuery
is now available from any of
your Spark or Hadoop workloads.
You can use Dataflow
for batch jobs
from BigQuery, if you want to.
You can go ahead and use
the ML Engine ODBC drivers.
All of them will be able
to directly leverage
the same storage
layer at high speed,
and you'll be able to go ahead
and do all these workloads--
different types of workloads--
on the data in BigQuery.
So this just expands
what you can do.
The third thing that we
have coming in BigQuery is--
earlier last year,
Next, I think, in July,
we announced Beta
of BigQuery ML.
So we will have a BigQuery
ML going to GA in few weeks
from now.
Along with that, we are also--
based on the demand
that we are getting--
we have k-means
clustering available.
So you want to do segmentation--
customer segmentation,
those kind
of scenarios-- you'll be able
to do that very easily with just
a couple of lines of SQL code.
You can do matrix factorization.
So recommender systems,
you can go in and do that,
and then you can import
Tensorflow models directly
into BQML.
So those are the
three key things.
The fourth key
announcement that we have,
and we announced this
earlier in the keynote,
is around BI Engine.
So the whole idea of BI Engine--
it's a fast, low-latency
analysis service.
You don't have to create any
kind of models or anything.
You automatically-- the
data that's in BigQuery,
it can accelerate
queries on top of it.
Our goal is to have
all the response
times in millisecond times,
under a second in most cases.
And then, it will
be available so
that you can do
interactive reporting
and interactive
dashboarding very
easily across your organization
at enlarged concurrency
numbers.
So that's another
thing that's available.
So I've talked about
a lot of things here.
Earlier, I also mentioned about
all the 100+ SaaS application
connectors.
We talked about BI Engine.
So Let's do a quick demo.
Let me call upon Michael
to come and show us some
of the capabilities
that we're launching.
Michael.
[APPLAUSE]
SPEAKER 1: So I'm
going to show you
guys what it's actually like
inside BigQuery, to go get data
from an external
source and bring that
in through a transfer.
Going to try and go through
pretty quickly, here.
Let's click the Transfer
button in BigQuery,
and that'll take
us to a view where
I can see the active
transfers that I have now.
So I'll hit the Create
Transfer button here,
and we have Google's
built-in transfers down here.
I can transfer from Google Play,
Google Ad, sources like that.
But now I can click
Explore Data Sources there,
and just like Sudir said,
here we have a long list.
We have more than 100
external data sources
built by our providers
that show up in this list.
So for example, here is an
Adobe Analytics connector--
highly-requested source from us.
This is made by Supermetrics,
one of our close partners.
And we can see details
about this connector.
I can also enroll in
it, or I can search
for others on the marketplace.
Another example--
Facebook connectors.
So we have some Facebook Ads
data here that you can see.
I can search for
Salesforce as well.
And when I search
there, top result
here is a Salesforce
connector built
by Fivetran, another one of
our really great partners.
And I can enroll
in this connector
right on the marketplace
and choose the project
that I want to enroll in.
I've already enrolled
for this project,
and so, because I did that, it's
going to show up for me now,
automatically, on my
drop-down list right there.
So I hit Salesforce by
Fivetran, and then I
can enter the name of the
connector of the transfer
that I want.
And I can choose the
schedule that's right for me.
We can go weekly, or in
this case, a daily schedule.
And I'll select to the
destination data set
inside BigQuery that
I want that to go to
and then hit Connect Source.
And right here, I get a warning.
This is asking me permission
for that connector
to write data into the BigQuery
data set that I selected.
So I'll hit Accept, and
then up comes this pop-up
from Fivetran where I can
authorize the connector
that I'm interested in.
So normally, this would ask
me for my Salesforce password.
I've already done that,
though, so when I click Save,
this is going to create
the connection for me
and take me back to
the Transfers page
where I can complete my
settings for the transfer.
I can also choose to get
notifications if I want to,
in case the transfer fails.
So let's click Save.
And that's going to configure
the transfer for me.
And there, you can see it.
The transfer and
run is now pending.
So that's really all
that I needed to do.
That's how easy it is to go
all the way from choosing one
of 100+ sources and getting that
into a transfer so that that
goes right into your
BigQuery data set.
That's really all
you need to do.
This particular transfer
takes about seven minutes.
So I have a data
set already set up
where you can see what
it looks like once
we're actually inside BigQuery.
And let me just show
you what it looks like.
Here's that Salesforce data.
We can preview the leads table.
There's city data, data on
the company, in this case,
for each of these rows.
We can go and query that,
join with other data,
and integrate it with other
information inside BigQuery.
But now I'm going to
show you the BI Engine
feature with this data.
So like Sudir mentioned,
with this BI Engine,
we have the capability to run
really fast sub-second latency
queries.
That's because it's running from
memory, from RAM, inside GCP.
So I can go ahead,
create a reservation,
and decide on the capacity
that I want with BI Engine.
And then once I've done
that, what can I do with it?
Well here's a great example.
This is a Data Studio dashboard.
It's running off
BigQuery on BI Engine.
And as I'm clicking
around here-- let's
see, filtering down to nurturing
and new leads, for example.
Maybe I want to slice and dice
this by Houston and Dallas.
And it's reacting really fast
because it's using BI Engine.
So please try out
BI Engine today.
It's in Beta.
And check out the external data
sources on the marketplace.
Thank you.
[APPLAUSE]
SUDIR HASBE: Thanks, Michael.
I think the key value
we can get by connecting
all of these different types of
applications that are there--
organizations are using various
different types of applications
now.
Bringing all that data
together, having analytics
across all of them,
and deriving insights
is going to be very interesting
for organizations, I think.
And making it easy is
one of our key goals.
Other than that, there are
a lot more other things
that we are also working on.
We are announcing-- I won't go
in depth of each one of them,
but here's some
additional information.
We will-- later this
year, we will have ability
to go ahead and do federated
queries on top of Parquet, ORC,
finds directly on GCS.
You'll be able to do federation
across Cloud SQL, which
are going to be another
data source for queries
from within BigQuery.
So that's there.
Other than that, there is a
good economic advantage report
that was created by ESG group.
You should take a
look at it if you're
moving to Cloud,
especially with BigQuery.
There is massive
amounts of savings
that you can get from a total
cost of ownership perspective.
Let's switch gears-- talk
about the second key scenario,
running large-scale Hadoop
and Spark workloads on GCP.
One of the key things from a
value proposition that we have
is we let you go ahead and pick
any of the open source projects
that you want to run
through Dataproc,
through Composer, that are
technologies that we have.
For example, we've been
continuously adding
more and more projects.
Now you can go out
and leverage Presto.
You can go ahead
and-- already-- you
could do Hadoop, Spark,
various different projects
underneath it.
It's secure.
We go ahead and do
the management of it.
We've defined-- we can
launch the clusters.
We can shut down the clusters
automatically and all.
If you really look at the
value proposition of this--
I won't go into depth of
each one of these points.
The key thing is,
if you were on prem,
or you were managing
it all by yourself
on compute engine versus
using a managed service,
the green is what
you will have to do
and blue is what
we take care of.
So the key thing is-- just focus
on the last column and see.
You just have to manage your
code, write the code, deploy,
and we take care of the cluster
management and everything.
That's the biggest value
proposition for the Cloud.
Especially with
[INAUDIBLE] clusters,
you can do massive
amounts of saving.
So you don't have to have
static clusters running
throughout the day
at scale, for you.
With that, let me call
upon Jonathan and Rares
from Booking.com to share
more of what they are doing.
[APPLAUSE]
[MUSIC PLAYING]
Rares.
RARES MIRICA: Hi.
SUDIR HASBE: And Jonathan.
So why don't you
introduce yourself.
Tell us more about Booking.com.
RARES MIRICA: Sure.
I'm Rares Mirica, I'm
a principle developer
at Booking.com.
I work on enabling the Cloud's
technology for Booking.com
and opening that up to Booking.
Booking is the largest online
travel agent in the world.
We employ over 17,000
people, and we have
offices in over 120 countries.
So we, of course, work with
a lot of data, as well.
I'm joined by Jonathan.
JONATHAN POELHUIS: I'm John.
I'm a data scientist working
in data quality in Booking,
so I care quite a lot
about all the products
we're putting together here.
SUDIR HASBE: Got it.
So what were the key challenges,
Rares, in Booking.com
before you started
migration to Cloud?
RARES MIRICA: Sure.
So at Booking, we run quite a
large installation of Hadoop.
And on that, the
workloads are mostly
Hive and Spark workloads,
both production workloads
and human interactive.
We have over 1,000 daily
users over these clusters.
So of course, because they
like to all work together,
there is a lot of contention for
resources over these clusters.
So that was a huge
challenge for us.
And that was an
opportunity for us
to use Cloud and give the
data scientists-- especially
for the more data intensive
workloads-- give them
personalized capacity--
so basically,
dedicated clusters per user.
And that was our
proof of concept work
that we started late 2017.
That was very successful
with our data scientists.
And that was the
business case for later
on triggering a big
data migration to Cloud.
SUDIR HASBE: So that means
every data scientist can
have their own cluster
that can spin up,
and then they can work on that?
Is that--
RARES MIRICA: Yes,
that is correct.
SUDIR HASBE: --your
thinking about it?
RARES MIRICA: The default is
a multi-tenant, large cluster
where they contend
for resources.
But they have the option
to basically elevate that
to a dedicated
cluster for themselves
where the data
scientist decides,
to a certain limit, the size
of the capacity that he needs.
SUDIR HASBE: Got it.
That's interesting
because that's
one of the benefits of
moving these things to Cloud
and having scenarios
where you can
have static clusters
but also burst into--
for specific workloads, and all.
That's--
RARES MIRICA: Yeah.
SUDIR HASBE: --really good.
Can you share about--
I know our teams
have worked together
on interesting
challenges you had
and how we have
incorporated some of them
in the product portfolio.
So can you talk about--
RARES MIRICA: Yes.
So aside from the challenges
of moving the data to clouds
and then integrating the data,
making these clusters appear
with the data on and making
them available for the data
scientists, the first thing is
that the data scientists asked
for was for the tool box--
their tool box to be the same
as on prem--
the same libraries,
the same integrations
with on-prem
technologies, and so on.
So this required customization
of Dataproc, the installation
of libraries, tools, et cetera.
We found out pretty
quickly that the time
to spin up such clusters
went up quite high,
and that was impacting
the user experience.
So our ask towards your
team was to make it possible
for us to create customized
images for Dataproc, which,
in collaboration with Google,
we managed to get now to GA.
SUDIR HASBE: Yeah.
RARES MIRICA: So that
is what we are using.
SUDIR HASBE: You
know, that's great.
I think if you always
learn from our customers--
it was a great scenario.
We were able to go
out and put that in
quickly so that
everybody can benefit.
So Jonathan, why
don't you share more
about what you have been
up to with the whole set
of technologies?
JONATHAN POELHUIS: I'd love to.
So working with our
Google counterparts,
we started an exploration
saying, well, now
that we are in Cloud,
there are some tools
that are available to us
like BigQuery and BigQuery ML
that we don't have on premise.
So let's see if we can use this
for a case close to my heart.
Can we surface some
data quality issues
that would be very
difficult to do otherwise?
So the scenario we
chose to explore
is very Booking in nature.
So we of course serve
many properties.
On the website,
you can find them.
Each property has
many room types.
That room type might
represent many rooms,
but each of those room
types is quite particular.
They're scaled
millions of properties.
So scale tens of
millions of room types.
And maybe most
particular for a visitor
is that those room types have
lots and lots of facilities
or potential facilities--
something like 176 of these.
So the scenario would be, a
customer visits the website.
Maybe they have a
particular facility in mind
that they'd really like
to make sure is there--
a bathroom, a TV, who knows?
They can filter for this.
Well, that's really
helpful from our side.
However, we then need to make
sure that that data is correct.
How could this go wrong?
Well, if you are
forgetting to list this,
as a property
manager or owner, you
can lose customers
by way of the filter.
And go the other way--
if you accidentally
say that you have
it, well then you
might be misrepresenting
yourself accidentally.
And then the customer experience
is quite odd when they arrive
and the bathroom isn't there--
say, OK, that's not so good.
I don't know about
your trip, but--
SUDIR HASBE: [LAUGHS]
JONATHAN POELHUIS: So how do
we fix these potential things?
They also might tell us
something about ourselves.
Maybe we can ask
better questions
of the property owners.
We can learn things
about however
we're putting this in
a form that makes it
so that certain
repeated mistakes,
we can eliminate
in a certain way.
So this is certainly an
added value for Booking
if we can get this right.
We're the intermediary.
OK.
So again, I mentioned the data.
It's very wide.
It's reasonably long,
tens of millions of rows,
and quite wide.
It's very Boolean, so it's yes
or no to having a facility.
But 176 of these--
it's quite a lot.
So we wanted to attack this
using something in BigQuery ML.
In particular, we're
going to end up
using k-means clustering.
Now, why do we want to
try that perspective?
Well, we could
attack it with rules.
We could say, ah,
we know for sure
that if you have
pay-per-view channels,
you'd better have a TV
to watch them on, right?
That's pretty reasonable.
However, 176 lends
itself to lots and lots
of subsets of rules.
It's very difficult to manage
and upkeep because, well, you
could be certainly adding lots
more facilities in the future.
So maybe all kinds of
HoloLens or something
like this is available
in your room.
So it would be very
difficult to manage by hand.
Let's see if we
can surfaces these
by throwing math
at this problem,
and especially in a quick and
iterable way via BigQuery ML
and SQL.
So we have one premise,
one assumption,
backing our project here, which
is, we assume most of the data
is pretty healthy.
It's pretty
representative of truth.
If that's true, then we hope
that similar things will
end up next to each other
in such a clustering
and oddities will stick out.
Odd things, we'll
be able to find.
So with this assumption--
I think it's a pretty good one.
We know our data
relatively well,
and it's not our first time
looking at this kind of thing.
So we're hoping math will find
the ones we haven't caught yet.
OK.
So k-means clustering--
there was a very nice talk
earlier today.
We gave a longer
version of this.
I hope you'll visit
it on YouTube.
But if we throw k-means
clustering at this,
we have just a few
lines of code to build
several different
versions of the model.
We only have to tune,
maybe, one parameter,
and we can very quickly
see what comes out.
Let's visualize some of those.
Well, actually-- sorry.
Let me take a step back.
What would clustering look like?
This has two dimensions.
Of course, we have 176.
Remember, our goal is to find
the things in the triangles,
the ones that really stand out.
If we can cluster
well, the ones that
are very far from they're
centroid, their middle,
are maybe the ones where
something is going odd.
So this is visually
what we're aiming for.
OK, here's in Data Studio--
very convenient that
we can look at this
through the GCP pipeline.
We have a lot going on.
But if you look at
the red on the right,
this is relating both
the size of clusters
and also how far they
are from each other.
OK?
Well, notice cluster 10,
there, is actually quite far
from any other cluster.
Well, that tells us something.
Maybe he's odd.
That's one of the ways
a cluster could be odd.
It could be very
far from others.
It turns out cluster 10 is
very weird for another reason.
If you look at the
green there, this
is the distribution of
distances within the cluster.
So for each cluster,
if you are at zero,
that means you're
at the centroid.
And if you're at
the top, that means
you're as far away from
the centroid as anything
in that cluster.
Now I care about
cluster 10, again,
because, well, the
farthest points
are the farthest of any
from their centroid.
So we probably should
look there first.
It gives us an indication
of where we could peak.
So let's take a look at a few
examples from, OK, cluster 10.
If we drill in,
what do we see here?
Well, let's look at the
things that this item has.
I've drilled in.
I'm looking at exactly one item.
The blue bars there
represent something
about the whole cluster.
Cluster 10-- how common is each
one of those room facilities
which are on the bottom.
So let's take an
example of say, toilet.
The red bar says that's
available in my outlier.
It's also very
commonly available
in the cluster in total.
I see a toilet.
I see shower.
I see body soap.
I see free toiletries.
These are all very good things.
Where would you put them?
Probably in a bathroom,
which is not available here.
The outlier does not have
a bathroom listed, anyway.
Maybe it does have one.
I don't know-- and
free toilet paper.
So maybe it's BYO-toilet
paper, I'm not sure.
Maybe you filtered
for that yourself.
Maybe that's something you want.
But from my trip,
I'd like to be sure
that there's toilet paper
waiting for me when I arrive.
That's good to know.
We should at least follow
up with the property.
Let's take one more example,
go a different direction.
Of course, this one might
have other bathroom problems,
but let's see what it does have.
So I see cable TV channels.
I see satellite channels.
But I don't see TV
or flat screen TV.
So I hinted at this before.
You have plenty of
channels to watch,
but nothing to watch them on.
We would definitely want to
surface this for the property
because they might
be missing out.
Anyone filtering for
that thing is maybe
going to miss this property,
but it's probably likely
that they have such a thing.
So we found some really
interesting stuff.
I think these are things we
wouldn't have found otherwise
or would have had a very
hard time to identify.
Next step for us would be
automation improvements
so that we could do this
on a very regular basis
and not have to explore by
hand the same way we did.
SUDIR HASBE: Got it.
Thank you, Jonathan.
Thanks a lot.
Some interesting
use cases there.
Thank you.
RARES MIRICA: Yep.
Thank you.
SUDIR HASBE: Yeah.
[APPLAUSE]
So-- yeah.
So let's continue on some
of the new investments
that we are
announcing right now.
I think we are investing in the
security features of Dataproc
so that Kerberos
is now available.
So you'll be able to use
the same security models
that you're using on prem.
You have auto-scaling
capabilities,
and then you also have--
one of the other big
three investments
that we're doing with
Composer is on Composer Flex.
So it will allow you
to make it completely
serverless Composer capability.
And then one of the
other things is,
we just announced our
partnership with Qubole.
A lot of enterprise
organizations
are using Qubole for their
Hadoop and Spark workloads.
And their whole unified
experience with the workbench,
with notebooks,
and dashboards is
super valuable for enterprises.
Now they're available on GCP,
so you will be able to use them.
They have great enterprise
security features, controls,
and governance, as well as
seamless workload migration.
So if you're using
Qubole today, you'll
be able to continue using
that on GCP from your on prem.
The next key thing that we have
is-- as I said, 25% of the data
will be generated in
2025 in streaming form.
And we have great capabilities
on the platform for streaming,
starting from
ingestion with Pub/Sub,
transform and analyze
across the board.
And then you can
also do that with
our open-source technologies
or partners that's available.
So with that, one of the
key announcements that we
have today is Dataflow SQL.
So the whole idea is,
a lot of organizations
use Dataflow from
different platforms.
You can write Java code and
B for Beam, and all that
and do it.
But SQL is a good interface.
A lot of customers like it, so
we're making that available.
Let's do a quick demo
from Sergei on that.
Welcome, Sergei.
[MUSIC PLAYING]
[APPLAUSE]
SERGEI SOKOLENKO: Thanks, Sudir.
So in this demo, I'm going
to take a Pub/Sub topic.
I'm going to associate a
schema with this topic,
and I'm going to join
it with a BigQuery table
to do some stream enrichment.
Once I have a schema topic
and a enriched stream,
I'm going to group it by
time and insert the results
into a BigQuery data warehouse.
Now the goal of
this demo is to show
you can calculate very quickly.
I'm going to get statistics
on a stream of events.
I'm going to start
with BigQuery.
Many BigQuery users will
find it quite useful
that they can now
access Dataflow right
from Query settings.
You get the choice of
the Dataflow Engine
as the execution backend.
And once you choose Dataflow
Engine and save the setting,
you will be able to
create Dataflow jobs.
I actually have a
SQL statement saved
in my notepad just
for demo purposes
so that I can avoid typing it.
I'll quickly explain
what's going on here.
I have it Pub/Sub topic.
This is not a table.
This is actually
stream of events.
I'm going to join it with
a static table in BigQuery,
allowing me to do a
stream enrichment.
Here's my join condition.
And the key portion
of this SQL statement,
this streaming SQL statement,
is the tumbling function.
This is the piece that allows
you to do streaming analytics.
It creates fixed,
five-second windows,
and you can run aggregations
on top of these windows.
That's exactly what's going
to happen in my projection
part of the SQL statement.
I'm going to create statistics
for sales regions for all
of the sales events
in my stream,
and I'm going to
have a timestamp
of these calculations,
and the sales amount.
Oh, and by the way, I
mentioned that we use schema.
We store now the schema
for Pub/Sub in the catalog.
Here's the schema
of my Pub/Sub topic.
This is what enables me
to run SQL on streams,
having a schema.
My events have very
simple attributes.
We have a timestamp,
and we have a payload.
And the payload
contains the person
who purchased the good, the good
itself, where it was purchased,
the state of purchase,
and the amount.
Well, great.
So let's run the job.
In my next screen, I just need
to type in the destination
table.
Initially, we support
BigQuery as the destination,
but we'll add more destinations
in the future as well.
Great.
So within a second or two, I'm
going to get a Dataflow Job ID.
That's the Job ID.
And if I click on it, I'm going
to get rerouted to Dataflow.
And once the resources
have been provisioned
and the SQL query gets
executed and transformed
into a Dataflow graph,
I will see things
in the middle of the screen.
Now I don't want to wait for
the purposes of the demo,
so I launched a job just
like it a few minutes ago.
Here's what you will see
in the Dataflow experience.
For those of you who are
familiar with execution plans,
that's exactly
what's going on here.
So Dataflow took
a SQL statement,
it created a execution
plan for your SQL statement
using the Beam framework.
I have my input
flow from BigQuery.
I have my input
flow from Pub/Sub,
and I have a join
condition in the middle.
And for this particular
SQL statement,
we are using very
efficient side input joins.
Now I also wanted--
to conclude the
demo, I also wanted
to show you that data
is actually flowing
through this SQL statement.
So I switched back to BigQuery,
and I have a select style
statement here.
Let me quickly run it.
Here are the results.
Let me rerun it again.
All right.
As you can see, I get my data
updated every five seconds.
SUDIR HASBE: Awesome.
SERGEI SOKOLENKO: Thanks, Sudir.
SUDIR HASBE: Thank you, Sergei.
[APPLAUSE]
As I started--
earlier today, when
I talked about our one
philosophy, which is,
making sure we make it really
simple for doing the activities
that you do today.
You could have done the
exact same thing in Java--
written a few lines of
code and made it happen.
But we want to make it really
easy for all the analysts
to go ahead and do the similar
activity on streaming data.
And with this SQL-based
language now, Dataflow
becomes accessible to everyone
in the organization that
can write SQL.
In addition to that, we also
have Flex RS Scheduling,
which is Flexible
Resource Scheduling.
You will be able to go
ahead and, for delayed data
pipelines, you
can save up to 40%
by using preemptible
VMs on our site.
The main thing there is, we
will guarantee finishing off
the jobs in any case
because we mix regular VMs
and preemptibles, and so
you'll get a lot of savings.
But we guarantee that your
jobs are going to get done.
There are a lot of
other announcements.
I won't go in depth
of each one of them,
but you can take a look at them.
From a governance perspective,
one of the things--
so we have different things
that we already offer.
We have the built-in
encryption, that is there.
Customer-managed
encryption keys, we
talked about it earlier today.
Access transparency,
Thomas touched upon it.
We also have tools for
efficient governing,
as well as compliance
with HIPAA and all
the different compliance things.
The key thing we are announcing
today is Data Catalog.
Data Catalog is basically a
fully-managed and scalable
metadata service which will
allow you to search for all
your data assets-- where they
are, what are they, who has--
all the different
details along with that.
It also allows you to go ahead
and define schemas for Pub/Sub,
which is streaming data.
So once you have
that, you can go in
and start using it in SQL.
It's a very simple
search experience.
It gives you ability to go
to auto-tagging with DLP.
We can run the DLP.
We can mark PII data and all.
And then you can also give
your own business metadata
that you can define.
And then from there, we will
be able to go ahead and define
policies on top of it.
So you can say,
anything that's PII,
do not give access to
these group of people
or give access to
only these people.
So you will be able to do
those kind of activities.
But fundamentally,
it's easy discovery
of your data within
your organization.
We're going to solve that
across all of different assets
within GCP.
Other than that, the
most important thing
is, you have a
lot of investments
in the different
tools that you may
have acquired from your
different organizations.
We have a huge
partner ecosystem,
and it should just run as
is, without any problems.
We have great partners in
PII space like Tableau,
like Looker--
all those tools are there--
Informatica, Fivetran, we
showed earlier today, as well as
Talend-- all of these partners
for ingestion and all.
So we have a huge partner
ecosystem for you to leverage.
Other than that, Google
has been identified
as a leader in both Forrester
Wave, as well as Gartner.
So we are trending in
the right direction.
We have a lot of investment
going in analytics platform
generally, and then it ready
for our enterpriser option.
We have lot of customers,
and it would be great
if you can go take
a look at some
of the key new
capabilities, start
playing with the products,
and give us more feedback.
Thank you.
[MUSIC PLAYING]
