ERIC ANDERSON: So
glad to have you here.
As you know, we announced
Data Prep this morning,
and so I had to obscurely
title this talk.
I mean maybe it is
a fitting title,
but it couldn't be as
descriptive as I wanted.
So maybe you are the
few that actually knew
what you were coming to today.
Maybe you read, you
understood the code.
If so, congratulations.
Yes.
So I am I'm Eric Anderson,
I'm the product manager
for Cloud Data Prep.
So we announced it this morning
for the first time publicly,
which I'm super excited about.
I can finally tell my
wife what I do at work.
No, it's not that bad.
And she is here tonight, so
you can give her a hard time.
So Cloud Data Prep, as you
know or as you may have heard,
is intelligent, visual, and
server-less data preparation
interface that runs on
Google Cloud's data services.
A quick overview on what
we're going to cover today,
and maybe to recap
the announcement.
It is available today
in private beta,
which means to select users.
And then we plan in the coming
weeks for this to be available,
I'm sorry weeks or months--
I don't want to
promise anything or get
you set to expectations--
is to be available publicly,
in a public beta.
We're going to start with the
need for data preparation.
Then go into the two benefit
pillars for data prep.
So first, the visual and
interactive interface that
we've described previously and
you saw today, this morning,
as well as the
server-less aspects--
how these are executed without
any configuration on your part
at basically any scale.
And then we'll wrap up
with a couple examples,
and spend most of
our time, which
is where I think you'd like
it, in a demonstration.
So the need for
data preparation.
And I'm sorry to pull out
the same numbers you've
seen before, but it's
part of the routine.
You have to do this.
As you may know, or may feel,
76% of data professionals
say that this is the least
enjoyable part of their work.
It's like flossing,
or maybe doing
the dishes it sounds like, 80%
of a data professional's time
is in data preparation.
So if we were to structure
this conference like you
do your data time, I'd be
talking for like three days
straight and the ML team
would have like 30 minutes.
And nobody wants to hear from
me for three days straight.
So let's figure
out what's going on
and how we can fix
data preparation.
First, the cast and
crew of those involved.
There tends to be
kind of a spectrum
or at least camps of people.
There's the business
folks who are
close to the problem,
the need for the data,
and then there are those with
the data skills, the data
scientists or engineers
as we might call them.
It turns out that
all of these people
have data preparation
needs and problems.
And Cloud Data Prep,
as you see today,
we think solves problems
for all these people.
Although they do have
their common tools today,
some of them are available
on Google Cloud platform
or within the G suite.
So I don't mean to
discount those tools,
these are great tools.
And if you're comfortable
with them, stick with them.
In particular, if we're
talking about data preparation,
Data Studio is a wonderful
place for data scientists
in an easy to set up
and a familiar IPython
like environment.
But today we're going to
talking about Data Prep.
And to get you geared on, again,
why we need data preparation,
I just want to start
with this kind of analogy
that we're going from
raw to refined data.
People often ask, so
what is this raw data
that needs all this cleaning?
Well, you've seen it before.
This is logs, billing
data, transactions.
And the reason why
we're doing the cleaning
is because as soon as you
try and visualize this or run
a report, you realize that, all
oh, there's these null values,
and that you'd prefer these
things in different columns
so that you could sort by them.
And very quickly, the
need for data preparation
arises, and then you spend
80% of your time doing it.
So let's use a
motivating example
throughout this that
we can refer to for why
we're doing this data prep.
Imagine you're one of these line
of business users I refer to.
You're more familiar with
sheets or maybe a SQL
every now and then,
and you'd like
to measure the impact and
value of your customer support
portal, specifically web chat
we'll say hypothetically.
The first thing you
might have to do
is figure out where
all this data lives.
And then you're going to
have to do some cleaning,
as we mentioned.
And you're going
to realize through
that process you need to
structure it a little bit
differently, enrich it.
These are the steps
that I'm referring to
as data preparation.
And, as mentioned,
you probably don't
have the skills to do this.
So if you were to try and
do this on your own today,
you might first enlist
the help of an expert.
We'll call it a data
scientist, but it's
somebody who can find
and understand the data
sources you're looking to use.
So this data scientist is
going to go through basically
that same process we
described, but they're
going to do it in a way that
you might not understand.
This is like ninja scripts,
as they're called in Python,
they're going to use R. And
other people in the room
are going to think, oh,
R that's such a bad idea.
And they're going to
use SQL, and they're
going to take us-- they're
probably going to do this
on a sample of the data.
And then use on
their local machine,
and they're going to
return this to you.
And you're going to say, great,
so this has all the answers.
And they're like, actually
this is just a sample.
And if you, really want
to run this every day
we're going to have to get
a data engineer involved.
So you're already like a week
or two into this project,
and you get the data
engineer involved.
And they do more or
less the same thing,
but this time they're
going to do it in Java,
and they're going
to ad unit tests.
And they're going
to run it on Hadoop.
And they have this
whole environment set up
so you don't have
to worry about it.
But now you're another two or
three weeks into the process.
And people are asking
you for funding.
So you're getting executive
stakeholders involved,
and maybe you don't want
to go through this in order
to measure the impact.
I might be a little
bit over-dramatic,
but I might not actually.
So this is where we
think Data Prep can help.
It helps you do this in
your own workstation,
on your own time, in the
moment you need to do it.
So let's get in to telling
you about that first benefit
pillar, the visual
interactive interface.
But before that, I want you
to be able to contrast this
with how we do it today.
So I've done some of this
data science-y type work,
and I have basically a little
cocktail of queries and scripts
that I like to throw
in a new data set.
Because I have to understand if
this thing is exactly what it's
called, how fresh it
is, and basically I'm
treating data like a black box.
I don't actually
look at the data,
I just kind of poke and
prod it to understand it.
This is one approach.
Another, oh, I'm sorry.
Whereas Data Prep just throws
you right into your data.
Rather than feeling through
a dark room for furniture,
we throw the flood lights on.
And you see your data and
you can interact with it
immediately.
We'll take a quick tour
of some of the elements
in the user interface so
you get a feel for them.
The first thing you're
probably noticing
is you're seeing distributions
along the top of your data.
Each column is profiled to
understand the distribution
as presented in
histograms and in ranges.
In the case of geographic
data or other data,
it's easily described
in different ways.
We can throw it on a map.
In the case of a time
series, we would display it
in a time series.
You're also seeing, at the
top, a type for each column.
And you're seeing as well
a mostly green bar, maybe
a little bit of black
and red in there as well.
This is telling you how
well that column actually
fits in that type.
So Data Prep has
detected the type,
and if it's all green then in
fact all the values are valid
along that type.
If you're seeing some black,
or there are missing values,
red, then these are actually
invalid according to that type.
You can adjust the
type as necessary.
You can also create custom types
if the 17 available immediately
in Data Prep don't suit you.
And there's also a series
of date and time types.
So we kind of got types covered.
This was the other kind
of motivating example.
So traditional ETL
tools put you in kind
of a logic centric view.
So if you want to apply
transformations to the data,
you typically see a block of
where your data came from,
a transformation, maybe a join.
But at this point, towards
the end of my pipeline,
I don't quite know what
my data looks like,
and it's hard to have
a mental model of what
I need to apply to next.
In the case of
Data Prep, we just
showed you how you have a
mental model of your data
when you're visualizing
the data grid.
And there it's easy to detect
what kind of transformation
you want to apply to next.
You can either use the
queues that we've provided
for you in the distributions.
You might detect an outlier
or want to transform
the missing or invalid values.
Or if you see some
other kind of anomaly,
all you do is just
highlight or click the data.
And what Data Prep
begins to do is
suggest the transformation
it thinks you're
invoking by this highlight.
It may take a highlight
or two in order
to arrive at precisely the
transformation you want.
These suggestions are
shown at the bottom.
And you'll see that they
have kind of sub suggestions
within a suggestion.
So you can alter the
suggestion as necessary,
and then apply it to a data set.
So you can apply one
transformation that way,
but generally you're planning
lots of transformations.
A clean here, clean
a clean there.
Some structuring, enriching.
These transformations are added
up into what we call a recipe.
It's a little more approachable
than the word script, perhaps.
And recipes in Data Prep may
be slightly different than SQL
or Python, are human readable
or at least human readable
to most humans.
And they are hidden by default.
So if you're the
line of business user
who just wants to
make a few changes
and doesn't want
to learn the world,
it's very easy and approachable.
But as you continue
to use Data Prep,
and if you become a power user,
it's also very expressive.
You don't need to feel
like you're held back
by some of the
ease of Data Prep,
because these recipes
still behave like scripts.
They export, you can apply
advanced functions within them,
and the recipe itself
can be reordered,
inserted undo, rewind,
play forward, roll back.
It all means the same thing, I'm
just adding and adding to it.
So you've explored
your data, you've
applied some
transformations, the data
is looking a lot cleaner.
But really, typically
with these projects,
you're trying to put the data
in a new place, in a new form.
They're still shaping
and enriching to do.
And joins tend to be the
hardest part, at least
for the SQL newbies.
I'll put myself in that bucket.
I still get a little bit
anxious what I have to do a join
and figure out the keys.
So in Data Prep you'll be
looking at a single data set.
You'll choose to
add a new data set.
You'll see a preview.
So all along the way
you're kind of knowing
you're on the right track.
I've got my original data
set, I got my preview.
And then when you select a
join we detect common keys
as they're called, common fields
that are likely to be joinable.
And you can choose the keys that
you want to include of these,
and then go ahead and start
choosing all the columns.
Really joining isn't so
much of a scary process
as much or just selecting what
you want from the two tables
and shipping it
off as a new step
in your recipe of
transformations.
You'll notice that
basically everything is
a step in your recipe
of transformations:
renaming things,
joining, enriching.
Which makes these
recipes portable,
because they contain
all the necessary steps,
including formatting in
that to shape a data set.
So we've done that kind of
a whirlwind tour of the user
interface of Data
Prep to introduce you
to the features of how
you interact with things
in a visual, quick way.
And I mentioned quick
because most of what
we do here at Google
Cloud, we pride ourselves
in it being very scalable.
So how do you quickly
navigate change columns on up
to petabytes of data?
Well, what we initially do in
Data Prep is you take a sample.
You're always operating
in the browser
on a sample of your data.
And We give you some control
over that sampling process,
so make sure it's
representative.
And so this next bit
is about how we then
take the transformations that
you've applied on your sample
and scale them out
to run on any size.
Now, here I draw a contrast,
not to traditional ETL tools
or to data science approaches
to data preparation, but in fact
to other kind of standard
approaches to data preparation.
There are other
vendors and tools
out there to provide
data preparation tooling.
And in those cases, what we
might call first generation,
they empower business users to
have to apply transformations
themselves.
But it still gets hard when
you want to scale this out.
You still need to hire
an IT and a data ops team
to make the cluster
of some kind.
You have now multiple
vendors, they're
providing you different
parts of the solution.
So you're negotiating
org-wide licenses
or you have to offset costs.
And then finally there's
this app permissions,
data permissions,
software permissions
that need to be coordinated.
Part of the benefit of having
this integrated in the Google
Cloud platform is that
the execution environment
is dead simple.
And it's baked right into
all the permissioning models
you're used to.
You simply hit Run Job, and
we spin up a data flow job
and execute this across
the entire data set.
So let's go into
a little bit more
about how data flow
does this for you.
Behind the scenes
these are the steps
that occur when a
user clicks Run Job.
Data Prep creates an Apache
Beam pipeline in Java
and submits that pipeline
through Cloud Dataflow.
So this is a translation from
your human, readable script
or recipe to a Java pipeline
that you may or may not
be very familiar with.
Dataflow then does
what it usually
does with a pipeline of this,
it detects the size and pace
at which it's making
progress to the data set
and auto scales as necessary.
It continues to add
resources optimized
for the execution of the job.
The job is producing metrics,
monitoring metrics that
surfaced too a easy
to read UI that you
can click to if you so choose.
But I can imagine many
users are happy to not
look at the monitoring UI.
And as it's being
processed, which
is kind of unique to a data Prep
job and not every Dataflow job,
we're doing profiling.
So as I mentioned
at the beginning,
you saw a perhaps representative
sample of your data set.
You saw histograms,
typing, and values
but these were all
histograms on the sample.
So one of the things you might
be wondering after you execute
this over perhaps
petabytes of data
is that sample, the 10
megabytes, even representative?
And so Data Prep is
then going to return
to you a profile of your data.
So here's a sample profile.
You're seeing the
histogram's you saw before
and the typing and data
validation you saw before.
We're also seeing a little
bit more information, which
can help you, like I said, make
the decision of whether this
is in fact, representative,
and there are any issues.
If there are, you can easily
jump back into your recipe,
take a new sample, or make
the necessary corrections.
And from this view,
as you can see,
we can also export
results and view a recipe.
So you're probably
wondering where
this petabyte of
transformation just ended up.
As with the integration
with Dataflow
Data Prep is integrated with
BigQuery and Cloud Storage
as the obvious
storage solutions.
You can read BigQuery
tables, Cloud Storage
files of the expected
common types.
And then, of course, you can
use these similar storage
as targets.
And there's a lot of
other storage systems
on Google Cloud, and we'll look
to those as well in the future.
And then of course, there's
other non-Google source systems
that we'd love to
support in the future.
But we're finding our
users in the private alpha
that we've had
previously, have had
no problem keeping themselves
busy with BigQuery and Cloud
Storage.
And for many of
our BigQuery users,
this is a boon for them in
terms of database migration
and schema migration.
So in maybe 15 or
20 minutes we've
kind of covered the nuts
and bolts of Data prep, nuts
and bolts of Data Prep.
I want to go into a
demo, as promised.
We'll spend the bulk
of our time here.
Just to kind of give you
some background on the demo,
we mentioned earlier that
our motivating example
was a line of business
user who wanted
to measure the impact in
quality of the web chat
experience for their customers.
I often hear this as like, I
want a 360 view of my customer,
as we mentioned earlier.
And I guess what
that means is that I
want to see what the customer
experience is like when they're
learning about the product, when
they're purchasing the product,
when they come back for
support on the product,
and maybe when they're not even
interacting with the product,
elsewhere in their daily lives.
Businesses want to understand
users in this sense,
and normally doing
this is difficult.
But in what we have
here, maybe a half hour,
we're going to see if we can
get a 360 view of our customer
with the help of
Cloud Data Prep.
Let's imagine I'm here in my,
I don't need this anymore,
my data kind of home page.
And we're already
seeing a few concepts
that may be new to you.
So flows, data sets, and jobs
is listed across the top.
A flow is kind of
like a workspace.
This is where I'm going to
put a collection of files
that I'm working on
for a certain project.
Data sets are what
they sound like,
as well as jobs are an
execution on Dataflow
So we're going to create a
workspace for this project.
We'll call it 360 View because
we're not afraid of buzzwords.
And I'm going to
add some data sets.
Now, let's pretend
that an engineer
friend of ours was able to
gather the necessary files
that he thought or she thought
would be interesting, and put
them in a GCS bucket for us.
So I've got here the web chat
log, a sample web chat log,
some Twitter information.
This is kind of
in vogue, everyone
wants to do social analytics,
and a web customer database.
So I don't really know what's
in here, as I often don't
starting a project like this.
I'm adding them to my flow.
And the first
thing you'll notice
is a couple helps
for your project.
So you're seeing on your
left, yes, your left my left.
You're seeing on your left
the lineage of the data.
In this case, I
have the black data
set, which is the
actual data set,
and then a virtual data set.
And in between
them is my recipe,
or set of transformations.
In this case, there
isn't a recipe, right,
I haven't done anything.
Oh, in fact, there might be.
So Data Prep, as
it loads, a data
set it makes some
you know educated
guesses about the first
thing you want to do.
It tries to eliminate you
know CSVs and the columns
and that sort of thing.
And so you can see
in this preview
in the sidebar what
Data Prep has initially
done in that recipe.
So my web chat log looks
just like an unreadable log
at this point.
So not much done there.
In the case of the
Twitter file, Data Prep's
been able to eliminate
this into some rows.
And similarly for the
customer database.
We're going to start with
this Twitter 360 file.
Oh, boy.
I actually knew this was coming,
but I don't like it every time
I see it.
So this is a nested
JSON like looking
thing is what I'm guessing.
And in fact, the typing
up here is telling me
this is an object of some kind.
It's all in the single
column, and my histogram
is not helping too much.
So let me bring this out a bit.
Yep, lots of punctuation.
So I can expand one
row to understand
what I'm looking at here.
It looks like I've got
this kind of nested object,
two parts: Twitter
info, user info.
And in fact, if I
look at my histogram,
there are in fact two bars, one
labeled Twitter info and user
info.
If I click on one
of these bars this
is kind of one of
the clicks that you
could do all throughout the UI.
Data Prep makes a suggestion, do
I want to create a new column,
as you're seeing here
below in the suggestions.
And you're seeing it's nesting.
All right, this is great.
So I'm going to un-nest
these two columns.
I'm going to highlight them
both, add this to my recipe.
And then I'm going to
drop this first column.
All right.
So we went from a scary,
single, nested object
to two scary, nested objects.
But I think we're
getting closer.
So I can basically do the same
thing as I did before, right?
Yep.
So as I click on a column I'm
getting now the time stamp,
and look at that it even
knows that it's a time stamp.
And so now the
histogram is meaningful.
I'm already getting excited.
So I'm going to highlight
all four of these.
And these to my recipe.
How are we doing?
Has anybody unpacked
JSON this fast before?
If you do we can talk afterward.
I want to do what you're doing.
All right.
Then I'm going to do
the same over here.
So I've flattened my JSON file.
I'm going to add
this to the recipe.
Don't mind me, I'm just
cleaning up these columns
that we don't need anymore.
All right.
So you know in just a
couple minutes I'm totally
making sense of this.
And I don't even know
what JSON means, right.
So I'm looking at this
histogram, it's a time stamp,
and it actually looks like
a fairly representative
histogram.
Like I might see
like a morning peak
here, midday trough,
evening peak.
So this is the time
series histogram.
The others, if it's
not time based,
are just a Pareto, highest
to lowest type of thing.
And these all look like normal
Paretos, kind of the 80:20.
And this is kind of a fairly
common part of the process,
I'm wanting to
explore my data set.
So location's got a
bunch of missing values,
maybe these are optional.
This looks good.
Let's just leave
this here, we've
got other data sets to explore.
I don't quite know what
I want to do here yet.
Oh, but I did want to make note.
I talked about us
operating on a sample.
So up here at the
top we're seeing
that we don't know
how big this file is,
but Data Prep took a 10
megabyte, as we expected,
sample of the data set.
And I could choose to collect
a new sample if I wanted.
Perfect.
So up here at the top
is my navigation bar,
I've got my flow, and
then the data set I'm in.
I'm going to switch
over to web chat.
Yeah.
I mean, I didn't really
like the nested JSON
and this file is not really
doing it for me either.
All right.
So I told you that
it's a good place
to start with the histograms.
So up here at the
top, this histogram
is telling me nothing
other than there's
a bunch of these things.
I'll select that bar.
Turns out these things are
these kind of meaningless lines
that would probably cause
us trouble in the future.
The suggestion is to
either keep or delete,
so I'm choosing the second
card here if you can see that.
The color all turned
to red as a preview
that we're going to
delete this column.
I haven't really
fixed anything yet,
but I'm feeling a
little better that I've
gotten rid of that problem.
And then I see there's
a missing value.
Perhaps this was like
a header or something,
but if there's any empty rows
I'll probably want to get rid
of them, too .
There it is, there's my one
row that, again, might cause
me problems in the future.
Going to delete it the same way.
All right.
I'm sorry to bore you
with the cleaning,
but it's what we
do and it's tedious
and it's kind of like flossing.
I'm going to select now.
I'm going try to
make sense of this.
So I've got this, we call
this unstructured data, right?
So there's this big
giant line of text,
but you and I can see that
there's something in here.
You can probably see that
there's like a session ID.
There's also a source here.
And so what I need
to do is somehow
communicate to Data Prep some
of what I'm wanting to do.
So I have selected this
key, that makes sense.
I want to probably
extract the source ID,
but I don't want to
extract just the source ID.
So I'm going to
highlight this key here.
And now Data Prep has an
interesting suggestion for me.
It says this is key
value pairs and asks
you to extract the
key value pairs,
according to the pattern,
create a new column.
Well, I don't like these
nested fields, as I mentioned,
but I think we're moving in a
direction that I can deal with.
So I'm going to add
this to my recipe.
Now I can work over here, and I
see in my now semi-structured--
so I went from
unstructured line of text
and now I have a
semi-structured object,
which is not always comfortable
but is actually more usable.
And I can do my little
trick from our previous file
we've all learned.
You people are all experts now.
The suggestion is to unpack
this structured object.
Now I'm too structured
to relational data.
I'm going to go ahead
and do some cleanup
as I've done previously.
You people all have the
luxury of seeing this
from a comfortable spot, but I'm
up here prone to make mistakes.
So I'm going to cleanup.
All right.
Now we've got some
structured data.
I'm looking at our histograms
and they look meaningful,
useful.
And I'm noting that
we're picking up some--
we're auto detecting
some types here,
which is always very exciting
as we're using Data Prep.
You're moving in
the right direction.
I am seeing these two columns.
So source just looks like the
same URL over and over again.
Although I appreciate that
it's detected, it's a URL.
And then this Geo info is
still not making sense to me.
Let's take a second and maybe--
so we've kind of been
exploring for a while,
but we probably came into
this with some inkling
of what we wanted to do.
We're measuring customer
service and web chat support.
Maybe my hypothesis
is that wait time
is going to be a key
indicator for me,
and I'm not seeing wait time
in any of the information
I've seen yet.
I am seeing here that
these are actual call logs.
Can everyone see this OK?
Actually I can zoom in further.
OK, so I'm seeing actual
call logs in here.
Like an agent will
be with you shortly.
And I'm seeing time stamps.
So if I can grab
these time stamps
I might be able to
approximate wait time.
I'm also seeing that there's
kind of two sets of data here.
I have this like some
key value pairs, and then
the time stamps.
And they seem separated
about this point.
Again, I'm going to
suggest to Data Prep
that maybe we should
eliminate here.
It first thinks it's going
to choose for all the spaces.
I don't want all the spaces.
And so in my card I can--
actually let me highlight one
more of these to help it out.
In my card I can--
so it wants to split
into 64 columns.
This is the split
card, and it actually
has other suggestions
within the suggestion.
So the second kind of little
bump here in a suggestion
is split not into 64
columns, just into two.
That's all I want.
So I'm going to select
this second one,
but note it's only splitting
on this like vague criteria
of the space.
The third one over
splits into two columns.
But it's little more precise,
it's after five digits
and before bracket.
I'm a little more
comfortable with this when it
runs across an entire data set.
So I'm going to add
that to my recipe.
And as is customary,
drop the column.
OK.
Actually I didn't want to
drop that column, Well, this
is a good time to review
what we've done so far.
Mistakes happen.
More often with me.
Over here on the
right is our recipe.
So everything, all the
steps I've done so far
have been collected here in
the order I've done them.
I can reorder these, which
is a little bit scary,
don't do that.
And I can delete and edit them.
So I'm going to
rewind this here.
And because I
actually definitely
don't want to drop
Geo info I'm going
to go ahead and delete it.
I mean, you can see
previews along the way.
So this isn't too
scary of an operation
when it's previewed before.
OK.
Let's ignore the Geo info.
We've done a lot of
key value pairs today.
Let's try and do this
like wait time thing,
if I can get to
these time stamps.
Can you imagine trying to
get these time stamps out
with SQL or with Python?
You'd have to do some like
regex, which all always
involves me going
to like a browser
to like read up on regex again.
So I'm going to
highlight a time stamp.
So this first one is when I
guess that the chat initiated,
presumably.
And this next one is when
it's assigned to a person.
So I think the difference
between that first timestamp
and the second timestamp
looks interesting.
So it's ready to extract
just one timestamp.
It's always helpful to give
Data Prep at least two kind
of highlights to work with.
Now it's ready to extract
the first timestamp.
It's reasonable, but
I want the first two.
I'm going want to see if
asking for additional rows,
additional bumps here is going
to help, and it's not helping.
So I'm going go ahead
and actually highlight
the second timestamp.
Forgive me folks if I'm
moving too slowly for you,
but here we are.
We got the first two timestamps.
And we're going to
extract them, and you can
see them previewed over here.
And, in fact, when
they get over there
you can see that, again,
the types identified,
these are timestamps.
Awesome.
Add these to my recipe.
So so far, we've shown you how
to use all the suggestions.
I mean, this has been
like point, click.
I'm not even doing any
work, but here we're
going to do a little work.
So there's nothing really to
click on to discover wait time,
right?
I can't just highlight
these things.
So we're going to use the
kind of step editor below.
Now there's the step
builder and editor.
We don't actually throw you
into a blank script editor,
it comes with all kinds of
like auto complete and hints
and suggestions.
I'm looking to like
extract or derive
a value from a calculation.
And here's a derived suggestion,
seems to match what I want.
The formula, I could browse
through some functions here
to find the formula
I'm looking for.
Is There's all kinds of
analytical functions supported.
I'm familiar with a
couple of date functions
from my data science days, and
this date diff is basically,
I think, what I need.
I'm going to give
it the first column.
And, oh, you know what?
I could do this.
Yeah, we'll do this.
We'll do this like this.
OK.
The start column
is the Geo info 3.
And the ending
column is Geo info 4.
I want these in terms
of seconds, probably.
Hopefully my support staff
aren't waiting for hours.
And you see my
preview, you're also
seeing some data validation.
So if I put seconds
it's not going
to work for me as well as here.
I've calculated my wait time.
I'm going to add
this to my recipe,
and I do that as an
example to show you
that you can dive
into these and modify
these steps as necessary.
Now, it was a little
confusing for me
to refer to these
as Geo info 3 and 4,
so let's do a quick rename.
I'm sorry.
Well, here's a view I
didn't mean to explore,
but you might be benefit from.
So this is at any point
you can look at a column
distribution in detail.
Look at the top values of
that column, the histograms.
Let's back out of here.
I'm going to rename
this up at the top.
OK, thank you.
So this is initiated time.
And we're not going to--
well OK, I'm feeling--
I'm going fast.
And this is assigned time.
Great.
Perfect.
So we've done a lot here.
We've got this giant table, and
it tells a lot about our users.
And we've also
learned along the way,
because this is actually
just fake data, how to edit
a sample with the suggestions.
And we've also used how
to use the step builder.
I now want to have a
little bit more advanced.
So in this URL, this source
URL, they all look the same.
But I'm noticing there's a
URL parameter question account
transfer.
And these vary by some of
them, our account lockout.
So I can do a switch
to the editor here.
I'm going to extract
from column, just
to give you an idea, you
can go ahead and write
whatever you want, if you
prefer scripting to the mouse.
So in my source column I'm going
to extract a URL parameter.
And because this is
typed, it understands
what URL parameters are.
And we saw in there that the
URL parameter of interest
was called to question.
Once that's fully validated
that, in fact, there
is a URL parameter in that
column called question,
I add it to my recipe.
Boom, I now know what question
was asked to which rep,
how long they were waiting.
I've done this all
in like-- we're still
going 20 minutes here, folks.
I'm going to keep you
for the full hour.
We're doing like 10 data sets.
We're getting it all done today.
I'm going to bed early.
OK, drop in this column.
Perfect.
All right.
So we've done now these
first two datasets.
Let's go jump over to
this customer database
just to get a view
of all of them.
This one comes already
structured and relational.
Again, the first thing I do
is look at the histograms.
I'm seeing a bunch of
missing fields here,
and I'm also realizing
that this isn't
much of a customer
database as it really
is like web analytics.
Last completed step, I guess
it could be customer database.
But either way this is tracking
users through a process,
perhaps through kind of a
checkout or purchase process.
I'm assuming that we're
wanting to get users
through all the steps.
And there's several steps
missing, several values missing
in the last completed step.
I'm going to highlight
these missing values,
and as I scroll through my
data set I'm seeing that--
I'm actually going to
highlight these missing
values from status, too.
So I see here there's succeeded,
incomplete, terminated.
So the last step either people
succeeded, they terminated,
or they were incomplete.
These empty rows are
probably like they
went away and never came back.
I'll make that assumption.
I'm going to go
ahead and modify.
So I keep, let's
see, set status.
So right now I'm setting
a status to nothing,
I I'm going to modify this.
I don't think we've
done modify before,
so this is a good
example for you.
Instead of modifying to
null, and you'll notice here
there's also some suggestions
I can modify to NA.
I'm going to say
no return visit.
And that's my recipe.
Perfect.
I think we're done here.
So we've cleaned, we've
structured in the case
of un-nesting these JSON files.
Now I want to enrich
these data sets.
They're only so good alone.
I could do-- let me go back
to my web chat file here.
To start a join I'm
going to select it
from the transformation editors.
And I'm going to first join
the web customer database.
I might have looked
earlier for common keys,
but I don't think I need to.
I'm just going to
select join keys.
And immediately, as
I mentioned earlier,
Data Prep is going to detect
and suggest common keys to join.
So the session works, I can
add additional joint keys.
Turns out that both customer
and user ID join here.
And if I were being more
careful and I had more time
I'd probably confirm that these
are in fact the right keys
to join on, but I did
this before and they are.
So that's OK.
So I'm going to select
all the columns,
and take out these
duplicate columns.
Now I've got this
giant, wide table.
And if any of you are familiar
with some of our data products
like BigQuery you know
that a giant white table
is exactly what you
want in BigQueary You
can use this process at
home and get your BigQuery
tables all set up.
Awesome.
Now I can also--
well let's stop there.
So I could join this
Twitter data set,
But think I've got
enough for the web chat
and the customer
database joined.
And I'm going to
go ahead and run
this job, because I feel like
I've got everything ready now.
I've enriched and I
want to do my analysis.
As mentioned, you can use data
flow jobs within Deat Prep
to write to either Cloud
Store or to BigQuery.
I'm going to select
BigQuery here.
I happen to have a
data set ready for us.
I'm going to create a new table,
and choose a replacement logic.
I've got my job running.
And as data flow
starts to run this job
I'm going to show you
the monitoring interface.
I mentioned earlier
that you may not
want to look at this
monitoring interface,
but it's kind of fun to
look at it at least once.
You get a feel for what's
happening under the hood.
So here we are, as mentioned,
this it's has been taken from,
kind of translated from
your recipe or script,
into Apache bean Java code.
It's now running on data flow.
We'll come back to this.
I actually have let's see--
I did this ahead of time.
So this is Data Studio.
As some of you may
know, announced earlier,
Data Studio came originally
out of our analytics offerings.
It's now part of
Cloud Platform and is
available in every country,
as of just recently,
perhaps this week.
I've gone ahead and
built a quick dashboard
on top of our output data.
And what you're seeing
here is I had a hypothesis
from the beginning is wait
time causing bad ratings?
So I've got here wait time in
one scatterplot versus ratings,
and I'm actually not
seeing much of a pattern.
It looks like wait time doesn't
actually affect ratings.
Personally, for me, I think it
does, but not in this data set.
I've got this other one here
which is how far along they
got in the process?
Was it terminated, succeeded,
no return visit, incomplete?
We're also seeing that wait
time seems to play no role here.
But the rating, the
customer service rating,
made people twice
as likely to succeed
in the process of
checkout, for example.
So we should be
focusing on the rating.
And I'm seeing over
here that my agents,
while they have
various wait times,
they're all kind of in-band.
Nobody's waiting much
longer than anyone else,
but the ratings are
actually quite different.
And I've got these two
low performers down here.
Finally, we are able to
extract that question bit,
and from the questions I can
see that one of these questions
has had a terrible wait time.
In fact, I'm guessing
that maybe there's
a queue for that question.
And we haven't quite mastered
how to do this queue,
or maybe we haven't
fully trained people
on how to respond
to this question.
I'm learning a lot.
This dashboard took me maybe
five minutes to put together.
You can see how quickly.
You know, I can edit this
pretty easily, drag this around.
Free to use, give it a try.
Let's go back to our
Dataflow job here.
You can see how many workers
were spun up, some of the auto
scaling patterns.
This is a very small job, it
would just use one worker.
Nothing really
interesting to see here.
And I wanted to go
back and show you
but I didn't have it
pulled up, BigQuery.
So this is where I'm
writing my data sets.
You can see our giant schema we
created is here and previewed.
And all the data we've written.
And I can bump back, I
have another example here.
Excuse me, I went full screen
and now it's giving me trouble.
Let's see.
So here's a list of--
I ran a bunch of
jobs earlier just
so you can see what this
job list looked like.
I've got one currently running,
I've got several completed.
I can show you profiled results.
And I'm kind of rushing
here, because I'm realizing
I'm running low on time.
But as promised, here's
what the job profiles
happen to look like
coming out of Data Prep.
If you don't mind, switch
me back to my presentation.
I just want to wrap up
with a couple more slides.
We talked through the demo--
what you've seen today is
you've seen Data Prep take
raw data, create refined data
through the steps we discussed
earlier: exploring, cleaning,
structuring, enriching.
And, again, we did
this in a half hour,
with some pretty nasty stuff
like that giant log file,
nested JSON.
And then where this
gets really interesting
is you saw that last
bit how easily I
can pull this into Data Studio.
I write it to BigQuery, Data
Studio reads from BigQuery.
This is also powerful
and important
offering for our cloud
machine learning products.
We're trying to democratize
artificial intelligence
and machine learning,
and a big part
of getting machine
learning right
is getting your
trained data clean.
And so we have to democratize
data preparation as well.
So a lot of you are
asking, I want it now.
I can see it in your eyes,
and if I weren't talking
you'd all be screaming at me.
A couple of points
for you, so you
can request early access today.
We have some companies
in early access,
and we're looking for more.
That's right.
And even if we don't
extend you access,
we'd love to keep
in touch and we
can give you updates
on the program
as we're approaching
public beta.
You can also try this
today downstairs,
maybe not at this hour
but definitely tomorrow.
That's up to those people.
It's in the code labs.
We've got one set up for
you, pretty excited about it.
Couple other pointers
for you, our UX desk
is looking to talk to people
using our data analytics
products.
Please stop by.
And I personally would love to
talk to you, send me a tweet
because then I can
analyze all my tweets
and I can analyze you,
and you know I will.
All right, that's it for me.
That's my same as
my next steps slide.
Thank you so much for your time.
