[MUSIC PLAYING]
SHEKHAR BAPAT: Thank
you for being here
for the DA201 Data Discovery
in Google Cloud session.
I am Shekhar Bapat.
I'm the product manager
for Data Catalog.
As you might be aware,
we announced Data Catalog
beta yesterday.
I'll be talking
about Data Catalog,
and put it in the
context of something
that is on everyone's mind--
that is, data governance.
And with me on
stage today, I would
have Renata, sitting
here, and Dimas,
both from GOJEK, who've
been collaborating
with us for the good
part of last one year
and have developed
applications using
the back end of Data Catalog.
And they'll be doing
a demo of that.
So this is the agenda.
We are going to do an overview,
talk about the bigger trends.
We'll talk about data catalog,
the overview, the architecture,
the various features
that we want
to highlight in the context
of overall data governance.
And then I'll do a little demo.
And then I will hand it over
to Renata and Dimas, who
will talk about
their application
that they have built for GOJEK.
And they're also going
to do a short demo.
We hope to fit all this
into the allocated time
and also leave some
time for questions.
So here's the overview
and trends, right?
So what is driving data
governance initiatives?
A lot of you are very interested
in how to manage your data, how
to discover your data.
Because we are in a world where
there is no shortage of data,
right?
Data is everywhere.
We are drowning in data.
The question is, how do we
make good use of the data?
How do we find the relevant
data, the right data
to drive your decisions?
And in the context of things
like GDPR, California Act,
HIPAA, et cetera, how do we
actually govern the data?
How do we make sure that
the right people have access
to the right data?
And then you can generate
reports, and be compliant,
and not get in trouble, right?
So there's the risk
management part of it
that needs to be managed.
There are the operational
efficiencies and other aspects
that you need to tackle.
And essentially, if
you do not manage it,
you don't have any
control over your data.
And data governance is driven
by these bigger requirements.
So in that context,
Data Catalog gives you
a framework, a foundation for
building your data governance
services.
And so we have a pretty
aggressive roadmap.
But we start with
data discovery, right?
Because without proper data
discovery, you have nothing,
and the organizations
need an effective solution
for data discovery, right?
And the challenge
of data discovery
is that your data is
not all in one place.
It's in disparate places.
It's in different systems.
It's spread all over.
So how an analyst--
let's take the canonical
case of an analyst who
recently joined the
organization and needs
to figure out what data--
what data they have
access to, and how
they can get access to it, and
how do they make sense of it.
And there is a need for
the data discovery tool.
So what we have--
this is the context
of why it is needed.
So we have built a system that
provides a unified view of all
your data assets.
And we'll start, initially,
with the data assets in GCP.
Let's say you have
data assets in GCP
in different sets of, like,
BigQuery, Pub/Sub, GCS,
and going forward, lots
of different systems.
We give you a way to discover
your data that is spread
across all these systems.
And it is a smart
catalog in the sense
that you have to
do minimal work.
So before I go into that, a
little bit of history right--
how did we come across this--
how did we come up
with this notion
of building a data catalog?
Well, Google has been
dealing with a large number
of data assets for many years.
And Google felt the need to
build something for itself a
long back.
And we are in our third
iteration of this internal data
catalog that handles more
than a million objects.
And it's used by more than
30,000 active Googlers.
So we have built a
very scalable system.
And now we want to take
that underlying technology
and make the data catalog
available to our GCP customers.
So that is the history.
Now talking about the data
catalog, in addition to
the technical
metadata-- and I'll
talk about what is-- what we
consider technical metadata
and what we consider business
metadata-- we enable users
to annotate business
metadata, and that,
too, in a collaborative
manner, to add overall value
and put data intelligence
in the data assets, right?
And as I mentioned
earlier, Data Catalog
provides the foundation
for data governance, right?
So Data Catalog was
in alpha for a while.
And we recently announced beta.
In the current incarnation--
this is the overview, right--
Data Catalog automatically
ingests metadata
from BigQuery, Pub/Sub, and GCS.
And it provides a simple search
interface for data discovery.
Your technical users as
well as your business users
can easily access it.
It supports a UI
as well as an API
so that you can do all kinds
of bulk annotation of data.
We have the innovation
here in business metadata.
Instead of having
simple tags, we
provide schematized tags to
capture really rich business
metadata.
We'll talk about
that in a moment.
And as I mentioned, we
auto-ingest the data.
Another strong feature, another
feature that I'm very proud of,
is that we have ACL controls
on the business mandate data
that also very strong.
And for technical metadata,
they're actually built in.
They just leverage
the existing ACL.
So you, as a user, have to
do very little extra work.
The last bullet is
very attractive to many
of our customers because
there is a lot of data.
Our customers have thousands
of, let's say, BigQuery tables.
And each has, maybe,
hundreds of columns.
And it's very hard to
easily identify PII data.
So we offer an integration
with DLP, Data Loss Prevention
service, which
automatically scans the data
and creates appropriate
tags to identify
sensitive data like PII data.
So a little bit about the
Data Catalog architecture
internally--
you know, if I had to
identify two blocks,
it's like we have
a metadata store.
It's a transactional store.
And internally, we use Spanner.
We also have a search
index that we built in.
The search index leverages
the same technology
that powers Gmail and Google
Drive, which has ACL checks
built into it.
And it's a very scalable
performance system.
So we know that we can
build a system that
is really, really scalable.
Now talking about, what
does it do to you as a user,
it makes data discovery
available at your fingertips
without almost any
additional effort.
Because since the data
catalog automatically syncs
all the metadata from
all these various sources
and puts them in index, you
can do a simple keyword search.
And anybody can just
find appropriate matches.
If you're a power user, it also
enables you to do facet search,
just like you can do for
G Drive or Gmail, right?
So you can say, OK, I want
to look only for table names,
or I want to look only
for certain views,
or I want to look for assets
where the column matches
a particular keyword.
Or I have created tags
that say, hey, this is PII.
So show me everything
that matches PII.
And in the demo,
I will show more.
But a combination of
simple keyword search
and really powerful
facet search enables
you to do data discovery
very, very easily, all right?
Now in addition to the
UI, as I mentioned,
if you have lots
and lots of data,
in order to annotate
them in a effective way,
we enable you to do bulk
updates of metadata through API.
And in beta, we are going to
be supporting Python, Java,
and Node.js language
libraries to make
it very easy for you to
do the API integration
in a programmatic way.
Now API enables our
customers to build
enterprise applications that
are very specific to their need.
It also enables customers
to build their own front
end if they choose to do so, and
use the Data Catalog back end.
I would like to point out that
GOJEK is one of our customers
who have worked closely with
us and built such a front end.
And they will be on
stage, here with me,
shortly to do a demo of that.
So I talked briefly--
I kind of referred to technical
and business metadata.
And since it's kind of
central to my discussion,
I want to spend just
a few seconds talking
about our notion,
our terminology.
What is technical metadata, and
what is business metadata, OK?
So technical metadata is
things like table names, data
set name, column names, table
descriptions, date created,
date modified, et cetera,
the metadata that is already
there in the source system.
That, we refer to as
technical metadata.
And as I mentioned
earlier, we automatically
sink that metadata from the
source system into Data Catalog
and put that in the index.
So if you create a new table
in BigQuery, within, like,
three seconds, that
metadata shows up.
There is no need
for you, as a user,
to manually register
these data assets, right?
So that is automatic.
But in addition to that, there
is a lot of business context
that needs to be attached
to the data, things like,
well, does this
data asset have PII?
Or who is the data owner?
Does it have a delete
by requirement,
delete by certain date?
Or does it have retain till
certain date requirement?
What is the data
quality score, right?
What is the cardinality?
There is all kinds of
other relevant information
that needs to be stored.
And without a proper system
to store this metadata,
our customers tell
us that, well, they
leave it in some wiki
page, some confluent page,
or some spreadsheet.
And it's not very searchable.
And it's not very usable.
Did I accidentally click that?
OK, sorry.
So-- I am having
some clicker issues.
All right.
Now here's example, on the
right side, of what the business
metadata looks like, right?
So the business metadata
is schematized tags,
as I mentioned.
Unlike other data catalogs
where the business tags can
be simple text
strings, we allow you
to capture really rich
business metadata using schema.
We believe that, just like
when you have complex data,
you need schema to manage it.
Since you have
complex metadata, we
give you the ability to create
schema for your metadata.
We call them templates
so a template can
have multiple fields,
and you can have--
I don't know if you can
read this from here.
You can basically
attach information like,
is this data asset
approved for use?
What is the-- does it have PII?
What is the type of PII?
And there are currently five
data types that we support.
We support string.
We support double, Boolean,
enumerated, and datetime.
And using these
five types, you can
create really rich templates.
And you can use these templates
to create individual tags
that you can attach to
individual data assets.
The next slide is sort
of an illustration.
For example, let's assume
that you have this thought
example of a customer table.
And you have certain
columns in that table.
Wouldn't it be
great if you could
have some sort of a sticky
note, an electronic sticky note
if I may, some sort of
a schematized tag that
has somehow captured all
this rich information?
For example, the common
thing that analysts
do when they come to
work in the morning
is say, OK, I need
to generate a report.
I need to know if
the last night's ETL
job ran successfully.
I wish somebody
could tell me, and I
didn't have to call
data engineering as to,
hey, what was the status of
the last night's ETL job?
How many rows were ingested?
Did we have any errors?
Did we have any warnings?
Now imagine there was similar
tag, the data governance,
that told you that--
what is the data classification?
Is it public?
Is it private?
Is it sensitive?
Where, in the
lifecycle, is this data?
Is it in prod?
Is it in test?
Is it deprecated?
To be deleted?
Where does it stand?
What is the data
quality tag, right?
What is the completeness
of this data?
What is the freshness
of this data, right?
Does it actually
meet the criteria?
Or has it fallen
off, and somehow I
should look at it with
suspicion, saying,
maybe this is not
the right data?
We allow you to create
such schematized tags
to capture all this rich
business metadata so that when
you come across a data asset, it
gives you the necessary context
to use it in a meaningful way.
We also enable you to create
tags at a column level.
So you can say, for these
columns, the PII is true,
and this is the type of PII.
This column might
have some funny name,
but it's actually SSN.
We are using Social
Security number.
This has email address.
There is another thing.
For example, oftentimes you
come across a data, a column,
and say, OK, how am I
supposed to use this?
How was it calculated?
For example, in this case,
we say lifetime value,
lifetime value of your customer.
It's like, OK, what was the
formula used to calculate it?
How shall I interpret that?
All that rich business
metadata that you need in order
to really leverage
your data asset
now can be actually put
right next to the data.
And it becomes searchable.
So you can search
by saying, show me
everything that has PII access.
Now when I talk about discovery,
of one of the things that
is really, really important is
that who has access to what,
right?
In many traditional
data catalogs,
you have to actually
register your data set.
And then you had to set a
separate set of permissions,
separate set of ACLs.
What we have done is we say,
OK, you don't need to do that.
First of all, we do
auto-ingestion of metadata.
And then we honor
the same ACLs that
govern data assets in
your source system.
So in this example,
we have two users.
User 1 has, say, read access
to all data sets in BQ.
And user 2 has read access to
only the first three data sets,
has metadata read
access to the fourth
the data set, and no access
to the fifth data set.
The view that user 2
has in Data Catalog
is that they can discover
A, B, C, and D. When
they click on the D data
asset though, they won't
be able to see the actual data.
When it comes to the
last data asset, E,
since they do not
have any access,
Data Catalog's search
actually does ACL checks.
And when you run the search, in
that 500 milliseconds that we
try to return results, we
probably run 100,000-plus ACL
checks to say, this
user should see what.
What are they allowed to see?
And we show them the
right data assets.
So you do not need to
do any additional ACLs.
And there is no data
exfiltration of data--
or metadata exfiltration
of any kind.
So your data is
safe in that sense.
The ACL checks that
I talk about also
apply to the business
metadata tags.
So in this case,
in this example,
we have three sets of tags.
We have an ETL tag, a
data governance tag,
and a data quality tag.
Now data governance
is able to create tags
that only they can see.
Because oftentimes, they put
very sensitive information
in that metadata.
And maybe they do
not want everybody
to be able to see that.
So our metadata tags, the
business tags are also ACL.
Now many of our
customers come to us
and say, oh, this is great.
And they told us that,
at the same time--
even though it's programmatic,
and it has all these smarts
our customers realized there's
lots of data where the column
names are not correct.
Column might say
column 17 or something,
but actually, it has PII data.
What should we do?
So we have worked with DLP
to provide the integration
that I referred to earlier.
And using that, you can
use DLP to scan your data.
Either you can do full scan,
or you can do sample scan.
And the integration enables
you to leave the data as-is,
but attach--
the DLP will automatically
scan and attach
the metadata that will
identify your data to say, OK,
this is PII data.
And the type of PII is,
let's say, email address.
And this is my confidence level,
like 99% certain that this
is the kind of
data, which allows
you to process large
amounts of data
automatically and identify
PII information, which can
lead to better data governance.
Because then you
can say, OK, who
should have access to it, why
is this accessible to everybody,
and then do the
necessary controls.
A brief-- this thing
about pricing--
we wanted to make Data Catalog
very accessible, eminently
accessible to all our users.
And at the same
time, we didn't want
to make it entirely free for
various reasons-- that it
will be abused, et cetera.
So our pricing is--
as I mentioned here,
it's $100 per gigabyte
per month of stored metadata
that is business metadata.
There is no charge for
technical metadata.
Because technical metadata
resides in the source systems.
And the way we look at it,
you have already paid for it.
You don't want to charge again.
Also, we give you 1 megabyte
of business metadata free so
that you can try things out.
In terms of API pricing,
there is a second dimension.
And we have 1 million
Catalog API calls, free
included for every user.
And we believe most users
would not need more than that.
But in the off chance that you
are developing some application
that calls lots of API,
it's like $10 per 100,000
API per month.
So it's a very
attractive pricing.
We are working
closely with a number
of partners,
Collibra, Informatica,
Tableau, and Looker.
And they will be using our API.
So I mentioned the
APIs are open and free.
We believe that
it's your metadata,
you can do whatever
you want to do with it.
And using those APIs, they
will be providing integration
in future so that your
on-prem metadata can also
be ingested into Data
Catalog so that you
will have one catalog to
find all your data assets.
And you won't have to
go to multiple places.
So at this point, I'd like to
switch over to a brief demo.
So I wanted to show you what
the Data Catalog looks like.
It's early.
We are close to beta.
We are not there yet.
In the next few
weeks, hopefully we
will update some
of these things.
But I want to show you how
Data Catalog looks like.
This will show up
in your console.
And any of you, if you want to
start accessing Data Catalog
today, I can whitelist you.
And you can start using
the alpha that's available.
So the Data Catalog
UI looks as follows.
It's organized along
the lines of cards.
We have different cards.
There is a card for
exploring data assets.
And each of these
links are clickable.
And you can just click them
to look for data assets.
And you can go back here.
You can create
your tag templates.
You can explore
your tag templates.
You can create
your tag templates.
There is search tips.
You can search using
tags, asset types, column,
bucket names, as I mentioned.
It's a rich facet search.
And there is a link to
an online documentation.
On the left side, we
show the popular tables.
In its current incarnation,
it says, OK, we'll
look at the usage,
say, in BigQuery.
And which are the
really popular tables
we want to show you
so that you don't even
have to search for this?
We plan to develop
it in this manner,
that, in future, it
will be more customized.
My goal is that 50% the
time, you shouldn't even
have to go for the search box.
You should say, oh, how
did this read my mind?
This is exactly what
I was looking for.
The other 50% of
the time, maybe you
go search for the data
assets that you want.
Briefly, I want to
show you, for example,
if I look for all
tables and views,
and I-- so did that
too quickly, I guess.
So we have all these
search results here
that match the table.
I haven't even
looked for anything
in particular in
terms of keyword.
But I found this table.
And this is in Data Catalog.
It shows me the schema.
It shows me the details.
And there is an integration
built in with BigQuery.
So from Data Catalog, if you
like that particular table,
you click on it, it
launches BigQuery.
And you can start querying.
You can look at the preview.
And you can-- this is the
integration we provide.
But in addition-- in addition,
if I say bookings, let's say,
I found a table.
And in this table, I have all
these tags, the kind of tags
that I talked about.
We have the data discovery
tag, data governance tag,
quality tag, template tag.
So let's look at the
data governance tag.
This sort of illustrates
how I can actually
categorize the data.
And I say--
I have the data classification
as an enumerated type.
And I can choose
the category to say,
is it public, sensitive,
confidential, regulatory.
I can specify the
data lifecycle.
Is it in prep?
Is it in test?
Is it in prod, et cetera.
And I can specify all kinds
of rich metadata tags,
as I talked about, right?
The other thing that I
wanted to demonstrate
is how we do ACL controls.
So in this case, I
have two windows.
And here, I'm logged
in as an admin
who has access to everything.
And here, I'm logged in
as an analyst, who has
access to only limited things.
And in this case, when
I look for bookings,
I should find the
same set of tables.
I should find the same
set of tables, right?
On the other hand, if
I go to, say, finance,
I find these two tables here,
and these two tables here.
Now in this case, when
I click on this table
and try to access
the data, I should
be able to access the data,
because I do have permission.
I can see this data.
I can preview this data.
In the other case,
I'm an analyst.
I can discover this data.
I can discover the metadata.
But if I actually try
to access the data,
I should get an error.
I see this error message.
And I cannot preview it.
Now that shows that
the ACLs are working.
If, similarly, I go
to the data asset
that I do not have access to--
as an admin, I have
access to HR data.
I see all this.
As an analyst, I go to HR data.
Right, I don't have
access to HR data.
That just quickly
shows the two things
that I wanted to highlight.
One is the schematized metadata,
the tags that we create.
And the second
thing is the ACLs.
At this point, I would
like to invite my buddies
from GOJEK, Renata
and Dimas, who have,
as I mentioned-- just to get
the context-- who have developed
the GOJEK front end.
Because they had a front
end that they were using.
And now they have kept
the same front end,
but they're using the
Data Catalog back end.
Without further ado--
DIMAS NATAJIWA: OK,
thank you, Shekhar.
And thank you, Google,
for having us here.
And thank you guys for
joining our session here.
We are from GOJEK.
My name is Dimas.
I am the Data Warehouse
Manager in GOJEK.
Before we go deeper into how we
are implementing Data Catalog
API in our data
discovery tools, we
would like to introduce
GOJEK to you first.
So GOJEK is actually
a technology company
that have purpose to improve
people's quality of life, OK?
And so I believe not many
of you know that, actually,
GOJEK starting as a
call center in 2010
for the ojek services,
which is-- ojek,
actually, is a term in
Indonesia for motorcycle ride
hailing or a motorcycle taxi.
Five years after that,
we launch our mobile apps
with three initial services,
which is GO-RIDE, GO-SEND,
and GO-SHOP.
And starting in
2016, we would like
to become a mobile
app for daily needs.
And we are starting to expand
into more than 200 province
in Indonesia and launch
more than 20 services.
And just last year, 2018,
we are making history
by doing international
expansion to Vietnam, Thailand,
and Singapore.
In the beginning, GOJEK is
built to provide a solution
for our informal sectors.
Because here in Indonesia,
we have so many challenge
for the informal sector.
Some of the challenge that
tools like GOJEK solve is,
first, we would like
to solve inefficiency
of time and accessibility, both
for the customer and service
provider.
Because in early
days, if you would
like to have ojek services, we
need to go to the Main Street.
We need to find an
available object there.
And in the other hand, most
of the time from the service
provider itself are wasting
to waiting for the customer
rather than providing
the service.
And the last one,
GOJEK would like also
to empower our service provider
for the financial services
so they are able to plan their
future and sort of only earning
money for one day.
RENATA KUMALA: So hi, everyone.
My name is Renata.
And I'm currently
the product manager
in the business intelligence
core team at GOJEK.
So like just what
Dimas say, we--
at GOJEK, we are here
to solve real problems.
And when we started
off in 2010, we
started off with just one
service, which is just a ride
hailing business.
However, as time--
as you can imagine,
in Indonesia, we
are a developing
as well as a growing nation.
So there are a lot
of problems that
are there to be solved,
a lot of opportunity
that we can help on.
And that's the reason
why we grow to,
now, more than 20
services in 2018.
So when you hear more
than 20 services,
like, what can you
provide to your user?
There seems to be
too many services.
So we provide the user
ranging from payment services,
to financial services, as
well as lifestyle services
like GO-GLAM.
So GO-GLAM is basically,
you can get connected
to a professional hairstylist.
And that hairstylist
can come to your home
and get you a fresh new look.
Or we have GO-DAILY, which is
basically a daily subscription
service.
So you can get delivered
your daily household needs
like tissues, rice, mineral
water on a scheduled basis.
Either you want it on a weekly
basis or a monthly basis.
So we're not only
growing in terms
of the number of services that
we are providing to our user,
but also regionally.
So we are now live
not only, obviously,
in Indonesia, but also
in Singapore, Thailand,
and Vietnam.
And user have
interesting at by--
you can see the numbers, with
130 million apps download.
We are partnering with
more than 2 million drivers
and 400,000 more merchants.
And we are now
transacting more than 100
million monthly transactions.
So as our business
grow, so does our data.
As you can imagine, when
you add one service tag,
and/or you add--
you expand to one more region,
it doesn't goes linearly.
It just doesn't go, like,
when you want services,
you add one database.
It doesn't go like that.
So the data costs have
become much more complex.
Even though we
are using BigQuery
as our data warehouse,
which already simplify
a lot of these
complexities, however, data
is not just raw
data that you get
from your application, right?
It also include
dashboard that you
have, and a lot of
data visualization tool
like Metabase, Tableau,
and also analyses
that a lot of your data
teams have created.
Like, we have the biz
intelligence team.
We have data science.
We have research team.
And on top of that, as
well, you have your people
inside your company, which also
provide a lot of those data.
So on top of this
complexity, and the variety
of data that you have,
and that we have,
we are still experiencing
a growing number--
volume-- of data.
So in a month to month
basis, we are growing 32%
in a month to month.
And there's our volume
data that we are
getting from our application.
And also, we have more than 700
data sources right now that we
need to aggregate
as a data team,
and more tens of
thousand of dashboard,
and all of this combined
with the complexity,
combined with a very demanding--
but we love them-- user,
more than 2,000 user
in a weekly app--
in a weekly basis.
So everything combined
leads to some major pain
points that we face
in a daily basis.
And it's not just us, data
team, but also our user, right?
So first of all, mainly it's--
it create two main
pain points, which
is productivity and government.
But in a real sense,
it's actually the times
that you actually
need to spend when
you want to search for a
particular question, right?
You can imagine, like
outside a company, when
you have a question, you
can simply Google search it,
and get the reference, read all
through, and provide an answer.
And you'll want to do the
same when you're in a company,
right, like us.
So it just takes too much time.
A person, for-- to answer
a particular question,
they would need more
than one hour on average.
Sometimes, they would
need more than one
day, because we just
realized that, oops, we
don't have that data.
So we might need to
go to the product team
and actually produce those data.
And secondly, it's also
taking a lot of time
for us, product team.
Like, PMs, engineer,
a lot of them
actually end up going to us.
Because they want to ask,
hey, I know the data is there.
But I'm not quite sure whether
the data quality is on track?
So can you run some query
and check it for us?
Or, so I know that it is there.
I checked it by myself.
The data quality is pretty good.
But I'm not sure whether it's
already updated for today.
So a lot of this thing also
adds up to our plate as well.
And it takes a lot of our
time, which, we are not
supposed to be doing that.
We're supposed to be
building products.
And the last part is,
because people find it hard,
and it takes too much time
to actually find data,
they recreate a lot of the
data object that we already
provide to them, and recreating
a lot of dashboard as well.
So that's the reason why we
create our own internal tool,
which is a data discovery tool,
which we call DataDex, which
basically, it's literally
like a-- we want to build it
like a Google search for all of
the data that we have at GOJEK.
So it compiles a lot of
the main data products
that people are using
inside our company, which
is BigQuery, the dashboard from
Metabase, and, as well, a lot
of analyses that are
made from the data teams.
And what's powerful
with this, it actually
reduce a lot of the time
that was previously wasted
to get to the answer quickly.
So now I'm going to
pass it down to Dimas.
So Dimas here will talk
about the back end of DataDex
as well as the limitation,
and why we chose Data Catalog.
Sorry.
DIMAS NATAJIWA: OK,
thank you, Renata.
I believe all of you already
hear how great DataDex is.
And however, this is
the back end of it.
So we are using BigQuery as
our main of data sources.
We have all of the
metadata there.
And then what happen
is we have one job.
We're just extracting
all of our metadata.
And we're grouping
it into its types,
like which one is the data set,
which one is the tables, which
one is a column label.
After that, we store all of
the metadata under the CoSQL
and export it back to
transform mitigation,
and push it back to
the elastic search.
Which, elastic search
will become our back end
for the DataDex, so it doesn't
need to go to the CoSQL.
But this needs to go
to the elastic search.
By having this
architecture, actually, we
are solving that solution.
But we create another one.
So what happened is,
now, our maintenance
is more and more complex.
Because once we are giving
BigQuery our metadata,
a lot of requests
coming in, like,
can you please provide metadata
for Google Cloud Storage?
Can you please provide
metadata for Tableau Dashboard?
All of them are coming
into our requests.
And the second one
is, our search result
might be inaccurate.
Why?
Because all of these jobs
are running in daily.
And the third one is
engineering effort
to maintain all of the
metadata, since the number
of our BigQuery tables, columns
is increasing instantly.
Currently, we have more
than 36,000 of table
and more than 700
column in our BigQuery.
So when we are facing
this difficulty,
Data Catalog team came to us.
And they have Data
Catalog as a solution.
We are starting
our collaborations
since July 2, 2018 as an early
access user for this program.
It's great to have
Shekhar and the team.
They are really supporting
us for our needs
in solving that [INAUDIBLE].
Just an example, they provide
one of the not-GS client API,
even before it is
released in the alpha.
And beside that, we are also
serving number of development.
Because we know
there are so many
feature that will be
available in Data Catalog.
And we do not need
to developing that.
So from that profuse
architecture,
we are going to the
simplicity architecture.
It's only a DataDex connect
to Data Catalog API.
What is the advantages by
having this architecture?
First, there is
really no operation
required to maintain all of the
infrastructure and availability
of the metadata.
Why?
Because Data Catalog are able
to provide a real-time metadata.
Once we are creating
a BigQuery table,
then it will be automatically
showing Data Catalog API.
And the third one, we
believe, with Data Catalog,
there will be more and
more feature coming in.
And it will make us more
focus into creating a business
and technical metadata
instead of developing the data
discovery tools.
OK, I think we can go
through the demo first.
Let's go to the demo 2.
RENATA KUMALA: Cool,
OK, let's start.
So for the first case--
OK, so before we start,
this is dummy data,
even though the schema
is exactly what we
have internally.
But all those value
are dummy data.
So let's start.
So one first use case is,
for example, I'm an analyst.
And I want to find--
I want to find a commission
for a particular merchant
for their GO-FOOD booking.
So let's search for that.
DIMAS NATAJIWA: OK, so you would
like to have analysis for that.
Let's find out the
merchant commission
GO-FOOD booking, right?
RENATA KUMALA: Yeah.
DIMAS NATAJIWA: Good?
RENATA KUMALA: Let's go.
DIMAS NATAJIWA: OK.
Here is this result.
RENATA KUMALA: Hmm,
OK, let's see--
OK, from all of
this option, I want
to see a summary table
that is updated daily.
So that mean the abbreviation
would be S for summary
and D for a daily.
Let's choose the first one.
DIMAS NATAJIWA: Cool, so
it is the first result.
This is the table's details.
RENATA KUMALA: OK, cool.
DIMAS NATAJIWA: What
do you want to see?
RENATA KUMALA: So I want
to see where it lies on.
Its in project bi-gojek.
It's in dataset access
and type TABLE--
cool.
Let's go to the schema and see.
I'm particularly interested
on a metrics name
called merchant commission.
DIMAS NATAJIWA: Sure,
it should be there.
Here it is.
RENATA KUMALA: Well,
OK, so that's all of it.
But I want to check
whether the data
quality is on track, whether
it's actually trustable.
Can we go to the tags and see?
DIMAS NATAJIWA:
Sure, here it is.
We have two kind of tags,
the technical and the data
governance metadata.
RENATA KUMALA: OK, so
the data quality score--
100%.
So it's perfect.
Let's just go to the
source to go to the link
and use that data.
DIMAS NATAJIWA: Sure.
RENATA KUMALA: Cool.
DIMAS NATAJIWA: Here it is.
RENATA KUMALA: So that's
the first use case.
As an analyst, you just want
to find out a specific keyword.
And it will redirect you
directly to the data source.
It will go to the
table in BigQuery.
So on the next use
case, which is as a data
governance personnel--
so for example, the
easiest way-- like,
the most basic way to figure out
you have a data quality issue
is to--
is from the raw differences
between one day to another.
So let's take the extreme.
So let's search for
the row differences
that is more than 1 million.
DIMAS NATAJIWA: Sure.
Let me type this first--
row differences.
The differences is 1 million?
RENATA KUMALA: Yeah--
DIMAS NATAJIWA: OK.
RENATA KUMALA: --1
million or more.
DIMAS NATAJIWA: Yeah.
Just for now, we
are still using EAG
because we are in early phases.
But I believe it will
change in the future.
Go for it?
RENATA KUMALA: Yes, let's check.
Hopefully there's-- ugh.
DIMAS NATAJIWA: Oh, OK.
RENATA KUMALA: OK,
so apparently we
have found an issue in
this particular table,
fd_gopoints_voucher_transaction.
So let's click on it and see.
DIMAS NATAJIWA:
Oops, here it is.
RENATA KUMALA: OK, so there
is more than 4 million row
differences.
Hmm, let's-- I would like to
know what's the source for this
table.
So maybe, let's search for,
like, voucher transaction,
right?
DIMAS NATAJIWA: Yeah,
sure, because it is lying
is staging data set, which is
with this in the data layer.
So we can go directly to
the Google Cloud Source,
because it is
source of the table.
Let's find out for the
voucher transaction.
And go to the Google
Cloud Storage,
because we have to see the file.
Yeah, there it is.
We found the source
of this table.
RENATA KUMALA: Cool.
So as you can see
here, you'll be
able to locate where
the bucket lies,
and also the file to that--
path as well.
And let's see the file as
well, within the files, OK?
DIMAS NATAJIWA: Sure.
So by using Data
Catalog API, actually,
it automatically transform
our Google Cloud Service
into a pattern.
So we know which one of
the file name that we need
to change at our [INAUDIBLE].
RENATA KUMALA: So by using
this, you can directly just
copy and paste it, and go
directly to the GCL file,
and create a table in
BigQuery, and being
able to analyze it, which part
does go wrong, and fix it.
DIMAS NATAJIWA: Yep,
that's all for our demo.
RENATA KUMALA: That's cool.
So from our session, basically
what we're trying to say
is data discovery is
a pain point for us.
We have built it by ourself
and faced a lot of problems
by maintaining by ourself.
And it just makes
much more sense for us
to actually use Data Catalog.
And the flexibility
that it actually
gave us by using the API
and having our own front end
interface, which can
cover a lot of our custom
needs, it really
does work and help.
[MUSIC PLAYING]
