[MUSIC PLAYING]
JORDAN TIGANI: Good
afternoon, everybody.
Thanks for having me here.
My name is Jordan Tigani,
Engineering Director
on Google BigQuery.
I was one of the first
engineers on BigQuery
when we sort of were
just still trying
to figure out what to build.
And I've sort of been
riding the BigQuery wave
for the last seven years, and
it's been pretty exciting.
I'll be joined by Lloyd
Tabb, the CTO of Looker.
I'll tell you a little
bit more about Looker
when we get to his part.
But in my opinion,
the best innovations
come from when you
break constraints,
when you have a wall that
seemed insurmountable, so
insurmountable that people
sort of forgot it was there.
And then all of a sudden,
that wall is gone.
And all of a
sudden, then you can
do all of these
creative things with it.
Here's an example.
How many people had a phone
that looked something like this?
Anybody old enough?
Yeah, we have plenty.
So much engineering
energy went into trying
to figure out what to do with
that damn keypad and how to--
all kinds of clever
sliding things,
and they didn't pay any
attention to the screen.
And then somebody,
OK, probably Apple,
came up with a
great idea of what
if we didn't even have a keypad?
And I think those are kind of
the most exciting things where
it's like, all of a sudden
this constraint, the keyboard,
was gone.
And the smartphones really
took off after that.
So what's the big
constraint in databases?
Well, the big constraint
databases is the table scan.
And you don't have
to take that from me.
You can take that from Oracle,
who's invested more engineering
energy in building vast
databases than probably
anybody.
And they say basically
table scans are a nightmare.
So what if you can we can
remove that constraint?
What if we can just
make table scans fast?
And so I'm going to start into
a little bit of how databases
work to sort of
build up into how
we can make table scans fast.
So the primary data structure
for databases is a b-tree
or a b+ tree.
And it effectively works by
building a tree of your data.
So this select
query that's doing
a filter between these two
boundaries filter on this key,
we're able to fetch that
data doing only five
look ups because we only have
to read these five blocks.
And I say chances
are the top part
of the tree is
going to be cached,
so it might only be two blocks.
Problem is, what if
you want to run a query
that you don't have--
you can't filter
conveniently on the key?
Well, then you have to
read the whole thing.
And that takes a long time.
So one way that people have
tried to scale databases
is they scale it out by a base
basically taking a database
and running it sort
of in multiple nodes
and then with some lightweight
coordination on top.
I won't mention any
competitors, but this
is what a lot of people do.
And there's a couple
of problems with this.
One is that you're limited
to the storage that's
attached to that node.
So it's hard to scale that up.
And the other one is,
I mean, first of all,
also like the coordination
effort gets really difficult.
In another session
concurrently where
people are talking about
BigQuery geospatial advances
that we're launching today, and
one of the things that it does
is it does-- you can do a
cross join that says show me
the points from this table
that are near the points
from this table.
In an architecture like
this, that's super difficult
because you have to send all of
this data between these nodes.
So the network bandwidth
becomes a big constraint.
So what did the
folks at Google do?
The Dremel paper came out 2010.
Dremel had been around
for several years,
several years before then.
But effectively, the way they
approached scaling was we're
just going to scale
everything out
super wide so each one
of these sort of boxes
is a separate machine
now or is a separate task
or a separate container
running in Bourg.
And we're going to make
everything a table scan,
because we have our fast
distributed file system, which
is which is Colossus.
So we can read from hundreds
of thousands of disks.
We can read on 100,000 cores.
And we can make this
table scan fast.
So we've been able to remove
a bunch of database design
constraints.
There's the I/O constraints.
Disks are slow.
If you read from a
small number of disks,
that's going to be painful.
Attached storage, you can only
deal with the storage that's
on that machine.
CPUs aren't getting faster.
Moore's law has sort of
ended for CPU speeds.
And there's also,
as I mentioned,
the network
coordination bottleneck.
So how does BigQuery do it?
BigQuery/Dremel, because
Dremel is the query engine
that powers BigQuery.
Well, it's going to read
from lots and lots of disks
in parallel.
We have the separation
of storage and compute,
so you don't have any
scaling bottlenecks.
As you know, I said Lloyd and
I were chatting yesterday.
He says, you can't fill it up.
He says that, not me.
I know how many
disks we have, so.
[LAUGHTER]
And we can use thousands
of CPUs in parallel.
And we can use our--
Google's very fast
Jupiter infrastructure.
So the Dremel architecture
when we released it sort of
looks similar to that tree.
This is a diagram from
the original Dremel paper.
One of the things
we soon learned
when we released BigQuery
is the log scanning
is not data warehousing.
People, if they're going
to remove their analysis
onto BigQuery, we have to
be good at data warehousing.
So we made a big bet on
dynamic query planning
and optimization.
And what this means
is we will dynamically
change the query
plan depending on how
that query runs so that we can
run it in a optimal distributed
fashion.
So 2018, this is sort of
a stylized version of what
the current architecture is.
It's no longer a fixed tree.
Basically every--
there's number of stages.
Each stage goes through
a shuffle, which
is a fast in-memory shuffle.
And finally, you have
the reduce at the end.
And we can dynamically change
how many shards we're using.
We can dynamically change the--
whether we want to do a
broadcast or a hash join.
We can dynamically change how
many stages, in fact, we're
running.
So BigQuery as a data warehouse.
I'm going to try to convince
you that BigQuery has,
and I think a lot of
other talks at Next
have also talked about
BigQuery as a data warehouse.
And I'm going to go
through some of the myths
and how we've overcome them,
or how those are not actually
true.
So one of the myths is, and
I think that this is even
in some of our
documentation, that
says you should
denormalize your data when
you're using BigQuery.
And if you're in an enterprise
data warehousing situation,
and you want to
move your Teradata
or you want to move whatever
your existing system is
into BigQuery you're like well I
don't want to change my schema.
I just want my old
queries to run.
So we understand that.
And so one of the things
that we've been working on,
because for enterprise
data warehousing
is you need joints to work fast.
And so this is sort of
how we've done over the--
this is the TPCH benchmark with
10 terabytes over the last 18
months.
And so we've increased
performance in the geomean.
So this isn't-- we're not cherry
picking individual queries
to run here.
But we've gotten 5x
faster across the board.
This is TPCDS, which is
another common data warehousing
benchmark.
That has also increased by 5x.
And I'm just going to show
you that last little bit
at the end, that
little dive down
is a performance improvement
that we just sort of rolled out
this week.
So can we switch over
to the demo, please?
So this is the TPCDS benchmark.
And this is sort of using
kind of the old system.
And I had this run.
And we can see that this ran,
it ran over 675 gigabytes.
And it took a minute
and 10 seconds.
I'm not going to make you wait
for a minute and 10 seconds
while that runs.
But I will switch over
to the new version.
The new version is exactly the
same query, that TPCDS query 6.
And we can run this,
we can run this now.
And it's run over-- the
query is exactly the same.
The only difference
is it's running
against sort of the clustered
version of the tables.
And I'll describe clustering
in a couple of minutes.
There we go.
So 13.3 seconds versus--
so this is a 5x improvement.
And this is just sort
of the kind of thing
that we roll out on a
Thursday, because it's ready.
Can we switch back, please?
[APPLAUSE]
Can we switch back to
the slides, please?
There.
OK, so now sort of we're
advertising our best practices.
Don't denormalize if
you don't need to.
If your schema is more natural
in the denormalized state,
then by all means
denormalize it,
or if the queries are
simpler, or if you're doing
joins against a large
dimension table.
So let's say you
have a star schema,
and you have your
fact table, and then
your other table
is fairly large,
you might consider
denormalizing.
But that's also something
we're working on.
And we're working on--
the reason I have
an asterisk by large
is because actually
what large means
should be increasing over time.
And as I mentioned, joins
have been getting faster.
And they will continue
to get faster.
We've got a bunch of
things on our roadmap
that hopefully we're going to
continue to improve things.
So I was chatting with
Lloyd, my co-presenter
here, a couple of months ago.
And he said to me,
I love BigQuery.
But you guys need to
improve partitioning.
And we're like
yeah, yeah, we know.
He actually didn't use as--
didn't say it as nicely as that.
But this is-- he approved
this version of the slide.
So the version of
partitioning that we've had,
and we've had for
a long time, is
ingestion-based partitioning.
And I kind of just
want to go through how
this works so that I can show
what we're doing differently.
Excuse me.
So BigQuery is a column
store, or has a column store.
So when you run
a select, we only
have to read the-- we only
charge you for the columns
that you read.
So in old BigQuery, we kind
of read down the columns,
but partitioning actually
lets us do a slice across.
So if you have a
partition filter,
we're using this sort of
magical pseudo column that's
based on ingestion time.
But that does let you
select only those bits,
and only have to pay for
reading the data that's
in that sort of intersection.
And only have to-- so
it's not just cost,
but it's also
performance improvement.
So just recently, we made
column-based partitioning
generally available.
Column-based
partitioning means you
don't have to use the
special pseudo column.
This means that you can
use a column in your table,
and that's your
partitioning column.
And queries on that will just
sort of naturally filter.
That's great.
That's been in the
process for a while.
That's not really going to move
the needle for the most part.
But today we're
announcing clustering.
And clustering, I think,
is really pretty cool.
And that takes
partitioning sort of way
beyond how it had worked before.
I'm going to just--
when you have table cluster,
you have your partition key,
but then you also have
your clustering key.
And what a lot of
people like to use
is like a user ID
or a customer ID.
And BigQuery will actually
store this in sorted format.
S-O-R-T-E-D, not S-O-R-D-I-D.
So when you filter
based on that,
you only have to read the--
we can read a very,
very small amount.
So to give you just sort of
an example of how this works,
so first we apply
the partition filter.
Then we apply the
cluster filter.
And we only have to read
a very, very small sliver
of that table.
So queries are going
to be not only faster,
but they're going to
be less expensive.
So clustering works with a
high cardinal-- generally
you use it when you have
a high cardinality field.
You only pay for scanning blocks
that match your cluster key.
Its going to be faster.
And it's going to
be easier to manage,
because we have a
lot of people that
are kind of manually sharding
one table per customer.
Now you can put that in one
table and just do a filter.
And it's sort of more
natural data management.
Switch to the demo, please.
So I'm just going to start this.
So last year, maybe
it was two years ago,
at Cloud Next I ran this query
over this giant petabyte table
that we had.
This petabyte table is a
simulated retail schema.
And we can see kind of
what is in this table.
It's got sales, and then you
can join out to products.
And so I'm doing a select
star, which you rarely
do in BigQuery because
it reads all the columns.
And this is doing
a filter by key.
So this is the clustering key.
Now one of the things that
I didn't show last time
was I didn't show how
much this query costs.
Because when I ran
it two years ago,
this was actually
a $5,000 query.
[LAUGHTER]
BigQuery is fast and it
scales, but you pay for that.
So we say, well, you can scale
it, can scale it to this.
And many people who
have this size table
would be using
fixed-price reservations,
so they wouldn't have
to pay per query.
But if you're going to
pay sort of retail prices,
that's how much it would cost.
And when I ran this, when
I ran this two years ago,
it took 270 seconds.
So I think it was--
I think it was
almost five minutes.
So we made some performance
improvement since then.
And usually this takes, well,
we'll see how long this takes.
It's going to take a
couple more seconds.
I'm going to switch to next
one and come back to it.
So that was just a
scan filter query.
And as I promised
you, BigQuery is not
just good at log scanning.
BigQuery is also
good at doing joins.
And so actually last year one
of the things that I did was I
showed how to do
query optimization.
And one of the
things that I showed
was an optimization was
OK, if you're joining--
if your filter is on the
wrong side of the join key,
or the join constraint,
then your query
is going to be really-- your
query can be really slow.
So this is just showing that
a, not only have we fixed that
by being able to push the
constraints through the query,
but also now we
can do-- let's see,
this was-- this query over the
petabyte table, 3.5 seconds.
We're actually doing a real
join against a real table.
All right now let's jump back
to this query, which finished in
just under two minutes.
So we're about twice as fast
as we were two years ago.
Things get faster.
But one of the cool
things is if you
look at how much data we
process, it's half a gigabyte.
So this is what you pay for.
So we've gone from
$5,000 to this
is about a tenth of a penny.
So this is, what, like
four orders of magnitude,
five orders of magnitude
price improvement.
[APPLAUSE]
And back to the slides, please.
So in case you're wondering
when do you use partitioning
or when do you use clustering?
So they are different.
If you have a
database background,
people tend to be
familiar with these terms.
If not, it may be a
little bit confusing.
Partitioning is generally
when you have a small number
of distinct values.
Less than 10,000 is our maximum
for partitioning right now.
Clustering can be
any cardinality.
You can have millions or
billions of different values.
In BigQuery, the dry run pricing
for partitioning is exact.
So before you run
the query, we'll
tell you how much it cost--
how much it's going to cost.
For cluster tables,
we're going to tell you
the maximum it could cost.
And if you have a clustering
filter, it could be a lot less.
That's sort of what you give up.
And so the query pricing
is exact for partitioning.
It's best effort for clustering.
So if you're writing
a lot to your table,
we may not have
sorted it completely.
And so we're going to
charge based on whatever
we have to end up reading.
There is a small
performance overhead
for clustered tables per
partition that you're scanning.
And clustering doesn't
suffer from any of that.
But on the other hand,
partitioning can be--
you can do data management
like it's a table.
You can have expiration times on
your clusters whereas-- sorry,
on your partitions, where on
clustering you have to use DML.
OK so I want to address some
other myths about BigQuery.
So one thing people
say is well, BigQuery
can't do row and
column level security.
So we do it in a
slightly different way.
You can use authorized views.
And so this has been
available for a long time,
but I kind of want to
describe it because I
think it's under utilized.
So first up, you can
create a view of the data
that you want to share.
And that could filter
out the rows and columns
you don't want to use, or you
don't want somebody to see.
And then you add that view to
the ACL of the source dataset.
Now what this does is
it lets the person who
owns that dataset see
exactly who has access.
So we're not doing kind
of delegated authorization
or anything, and you
don't have to worry about
if the person who created
that then leaves the company.
You can audit.
You can actually see
what's on that ACL.
So just to show kind of how
this works on a typical table,
let's say you have a
phone number which has PII
and you don't want
people to see it.
And maybe you have some
sort of like special rows
that you also want
to keep private.
So you just create
a view definition
that, let's say, it filters out
the rows you don't want to show
and hashes the phone number.
And so the person--
so by default
you can only see
this view if you
have access to underlying data.
So the next thing you do is
you can add to the data set.
You can say allow this
view access to this table.
And then you can share
access to that view.
So one of the things we like
to talk about in BigQuery
is just how easy it
is to share your data.
And we see this as a positive.
Lloyd will also describe
a little bit about that.
But I was talking to somebody
at a briefing in Texas,
and he's like, well, easy
data sharing, that sounds
like a bug and not a feature.
That means that
somebody can share it
with somebody that I don't
want it to be shared with.
So Google Virtual Private Cloud
is something that's in beta.
And that's going to
allow you to prevent--
and so we could have kind
of implemented something
like this just for BigQuery,
but this is actually
sort of been implemented across
Google's Cloud using kind
of standardized mechanisms.
So you'll be able
to prevent anybody
from sharing your BigQuery data
outside of your organization.
You'll be able to define
which services can be used,
like so maybe you
can't export it to GCS.
It allows you kind of to control
the routes of data filtration
and cut it off completely or
cut it off to certain people
or certain buckets.
And you can also say
that this can only
be accessed from your
organization's IP range.
And so with like
the virtual cluster,
you can have stuff in your
Compute Engine instances
that look like they're from
your organization's IP range,
and so those will
just magically work,
whereas some random hacker is
not going to be able to access.
Another thing that
people tend to want
is they want to be able to
control their own encryption
keys.
So BigQuery has a generally
available feature,
we have since February,
that allows you to manage
your own encryption keys.
And so this sort of reduces
the kind of debug ability
from the BigQuery side.
Like we can't go
in and figure out
what went wrong with
BigQuery as easily
without getting
explicit approval.
And also, it allows
you to basically
just be in control of how
encryption is being applied.
So this is sort of how
the encryption works.
And the final decryption
actually happens all the way
down to the storage layer.
And the key material
is actually--
never flows through
the BigQuery servers.
Nor does the encryption or
decryption happen there.
So I also want to share--
so with GDPR and with
other requirements,
people want to have
clean ways to manage
their personal identifiable
information their PII data.
And so I want to
share a design pattern
that one of our customers
uses that uses encryption
to control who has access to
PII and actually remove it.
So for sort of if somebody
wants to be forgotten.
So let's say you have
a table with PII.
This is sort of the basic table.
The phone number column is PII.
So you can encrypt
that with a key.
You can encrypt that with a
key per row or per user ID.
You have a separate table
that has the keys that
match up with each user ID.
So when you read that, you
can run an AES decryption,
which is the new functions that
we have to decrypt the data.
And the nice thing about this
is you can share this table
with anybody you want.
They can copy that table.
You don't have to
necessarily chase it down,
but they can't actually
read the PII in it
unless they have access to
this underlying keys table.
And then if-- for the
right to be forgotten,
or you want to get
rid of that user,
crypto delete is as
simple as just doing a DML
delete from that keys table.
So you don't have to go back
and figure out all the places
where that PII went.
It just can now no
longer be accessed.
Another myth is that people
are worried that BigQuery makes
it easy to run up a big bill.
We now have self-service
cost controls.
I have one more
demo, if you don't
mind switching over, please.
Thank you.
And so I just want to
show how this works.
This is the Google
Cloud console.
You can get here from--
where'd it go?
I am an admin, and quotas.
Filter by BigQuery API, and
you can change the allowable--
if I can click on this, no.
You can change the
allowable usage per day.
I can say edit quotas.
Sorry, this is Lloyd's laptop.
He doesn't have permission.
But had Lloyd did--
which I guess is a good thing.
We don't need him to
change our quotas.
But had we had permission, we
could just say edit quotas here
for the BigQuery API,
the query uses per day.
And that gets
applied immediately.
So that's across your project.
There's also a per user
one that you can say.
So if you want to say
each user should only
be able to process 10 terabytes
a day, you can set that.
You can set that here.
That one doesn't get
applied immediately.
There's a little bit of a delay.
But at least there's
sort of self-service ways
that you can manage that.
Back to the slides, please.
Something else that
is sort of a myth
is that BigQuery doesn't
allow enough DML updates.
We do have batched
DML operations.
So often if people are used to
do sort of individual deletes,
you can delete
from an array, you
can delete based on another--
based on a filter.
You can delete based on a
join from another table.
But there's another
really cool thing
that's in beta right now
is the SQL MERGE statement.
And I wasn't aware of
the SQL MERGE statement.
It's actually a
pretty slick thing.
So I want to go through how
this works a little bit.
So let's say we have
a customer's table,
and let's say we have another
table that has our updates.
And we want to sort
of apply those updates
to the customer's table.
So the first thing
we do is we create.
We start our MERGE statement.
And we say we're going to merge.
And we're going to kind of match
those up using the user ID.
And you can use complex
conditions here.
So then we match those.
You can then define
match conditions.
So if there is a match between
the two, so between the user
ID on one side and the
user ID on the other side,
and you can even make those
conditions arbitrarily complex.
You can say also if the table
on the left is not locked,
then we can do something,
and the thing we want to do
is update it to be the
thing that's on the right.
So fancy animation.
We seen we've updated that row.
But you can do more
with a single statement.
You can also say, well,
when if there's not
a match, which means that
if there's not something
in the left table, or
if there's something
in the right table that's
not in the left, in this case
we'll say well, we're
going to insert that
into the left-hand table.
But wait, there's more.
We can even say would not
match by source, which
means if something
is in the left table
but not in the right table, we
can apply complex conditions
as well.
We can just say, in this case,
we're just saying delete.
But this just allows you to
do sort of really complex,
like read, insert,
update, deletes
all in a single statement.
All right, so I've
tried to convince you
that BigQuery is a great
place to do data warehousing.
But I think one of
the things that I
showed with you,
the reason that I
had that, the initial removing
kind of removing constraints
discussion, was I think
BigQuery can be more,
is more than just
a data warehouse.
And of course I'm biased.
I've been working on
BigQuery for seven years.
And Google pays me to say
nice things about BigQuery.
So Lloyd's going to come up and
talk about sort of beyond just
normal data warehousing,
and the kinds of things
that he's been able to do.
So Lloyd is the
inventor of Looker.
He is a big kind of
BigQuery aficionado.
He's been able to push
BigQuery really further
than almost anybody
that I've seen.
So I think there's
not anybody better
to show what else you
can do with BigQuery.
Thanks, Lloyd.
LLOYD TABB: Thanks, Jordan.
[APPLAUSE]
So as Jordan says,
I'm a bit of a geek.
So I'm not much of a speaker,
so I apologize in advance.
I'm going to frame what you
can do with BigQuery first.
I've been trying
to think about how
to explain this for a long time,
since actually starting Looker.
And I still never got it.
It's never been perfect.
So this is a new
way of describing it
that I'm going to
try on you guys
and see if it makes any sense.
In the traditional
reporting world,
I think of that world as
2D data and 2D information.
And those-- that answers
questions about what happened.
So a traditional 2D thing might
be something like the New York
Times.
The New York Times
is vetted every day.
It's a digest of the best--
of the most of the things
that happened that day.
You know that the
information is accurate.
You can bank on the
veracity of the information.
And it's important.
A textbook is another
example of a 2D report.
And for the most
part in enterprise,
2D reporting is what you
run your business on.
The SEC is very care--
if you build a report and
it goes through the SEC,
you very much care about
the accuracy of that.
There's another
kind of information
though that's happened since
the dawn of the internet.
And I was lucky enough to
be around at the beginning.
And that experience,
that 3D information
is something like, say, Google,
where you can go ask a question
and do some research
about a subject.
So the 2D information, say--
here's an example.
Suppose you're reading
The New York Times
and you see a new competitor.
And you read about the company.
Well, if you're interested in
that, you know that it exists,
but there's not a lot more
that you can do with it,
with that piece of information.
And pre-internet you you'd
run down to the library
and look for other
periodicals about it
and see, or call people
and see if you could
find some information about it.
But in the post-internet world,
you go to Google and you type
in the company name, and
you look at LinkedIn,
and you see how many
employees they have.
And you might look
on Crunch Base
and see who the
investors are, and who
are the principals in it.
And you might look on
Glassdoor to see whether or not
the employees like working
there and what's going.
So there are other
ways of doing research
about something to
get a better picture,
to get more perspective.
So 2D is flat and
accurate and perfect.
And super important,
because you can't
run your business without that.
And most of the time,
we think of data,
we think of it that way.
But there's another way
of thinking about data,
which is the investigative way.
And it really hasn't really--
these new tools like BigQuery
allow that to happen.
So what's important
in that is that you
have a pretty comprehensive
source of information,
so like the web.
There's lots and lots of ways
to look at it, and to dig at it.
And because of things like
these new data warehouses
like BigQuery, you can put
all the data in one place
and then do BigQuery things,
like BigQuery and Looker can
actually do the research to
find out information about that.
And so the 2D information
answer is what happened.
And the 3D is about diagnostics
about why it happened, right?
And there have been--
you can extract data
in with small data sets, try to
look at why something happened.
So people have been interested
in that for a while.
But really to get
a holistic picture,
you need the whole set.
So that's what I think that you
should be thinking about trying
to build out of BigQuery.
And so I'm going to give you an
example of 3D data experience.
One of our early customers
was a hotel aggregator.
And there was a
marketing person.
And they had a 2D report,
which was they were
looking at their user base.
And they were trying to
figure out referrals,
because cost of acquisition
for this business
was super important, right?
In order to be able
to drive business,
they got to drive the cost
of customer acquisition
down so that they could then--
and to spend more on
the platform over time.
So they had a little report.
It was just a grid.
And on the one corner,
it was how many bookings
they had going across
and how many referrals
they had going down.
And in the lower left-hand
corner, there was a number.
And it was big.
So these were people
who had never ordered
but referred a lot.
They had people who
had ordered a lot
and referred a
lot, that was that.
But what surprised
them was the number
of people who had ordered--
who'd never ordered
but referred.
And they were like,
were we getting gamed?
What's going on?
In a 2D world,
you're kind of done.
That's it.
I got to go back to the
data team to get it.
But in a 3D world,
they can click
on it, which isolates the data.
Basically would isolate it to
the both of those constraints,
never ordered and
referred a lot.
And they could then
look at the detail.
So it's isolation and
drilling into the detail.
And then they got a
list of email addresses.
And they said well, they
weren't an engineer like me,
so they were marketers,
so they were not
afraid of actually emailing
them and saying, who are you
and why are you doing this?
[LAUGHTER]
And they got it.
And it came back
I'm a taxi driver,
I'm an airline stewardess,
I'm a bartender,
I'm someone who gets
asked the question
where should I stay
tonight, right?
And that changed
their business, right?
That ability to actually walk
around and drill into the data,
actually going from
the 2D perspective
through the 3D perspective gave
them a chance to see more data,
to understand why
that was happening.
I'll give you another example.
So Looker, I've been working
on Looker for six years.
But I've been working, I've been
a CTO at a bunch of companies.
And one of the
companies I was at,
I was the CTO of an
online crowdsourced call
center with 30,000
home-based operators.
And my job was to make it so
that everybody in the company
could see data.
So I built some
special purpose tools.
And we have this measure
that we use called a chup.
Who knows what a chup is?
A chup is a phone call where
the caller hung up before it
came to an operator, right?
And so how you--
so an operational
measure that we
picked was the number
of chups over the number
of total phone calls.
What percentage of
calls were chups?
And then we would look
at this over time.
And in the middle of the
day, we saw a big spike
that said there was a bunch
of chups in an hour period.
So we isolated it.
So we were like, well,
why are there chups?
And so the 2D report
said there was a problem,
but we wanted to
do some analysis.
So the way that you
start to do analysis
is you isolate it to that.
You filter on the
range that's there.
And you start to dimensionalize
it by different things.
You can dimensionalize
it by operator.
Well, was there one operator
who wasn't picking up?
Was it we were routing to them
and they weren't picking up?
No, that wasn't the problem.
Was it one program that we had
a whole bunch of phone calls
for a particular program that
we were answering calls for?
No, that wasn't the problem.
Well, let's isolate it
by telepathy switch.
Oh look, this switch
is broken, right?
And so by
dimensionalizing the data
and walking around the data,
you can actually see it.
And so these are
the kinds of things
you can build out
of a modern data
warehouse and a data platform
like Looker that's not possible
before.
So traditional
data warehousing is
all about the accurate
report, right?
It's all about getting
the data right.
So in order to do
this, what you do
is that you vet the
data on the way in.
So the traditional
way of putting data
into a data warehouse
is that it's small,
it's relatively small.
The data warehouse is expensive.
Storage is expensive.
So what you do is you take
your transactional systems,
and you vet it, and
you compress it,
and you put it
into the database,
and you know that the
data is accurate once it
lands in the warehouse, right?
The whole idea is get
it in there, it's right.
So the data team is
working as the ETL team.
They're the person that makes
sure that it's there right.
In the new data
warehousing model,
you land the data
in the warehouse
and then you build
models on top of that,
and you code on top of
that to make-- so that you
can draw more together
so that you have access
to all of the detail.
And you build-- we say that you
build a model against that data
so that you know
that it's accurate.
And when you pull it out,
when you pull the data out,
it comes through code as opposed
to coming through the way
that it went in.
And so we've been
working together
with BigQuery for
quite some time.
And together we make
this great platform
that you can use to do this
3D experience and data.
They invited me up here because
I do wild things with BigQuery.
I'm going to show you a
couple of cool things.
One thing that
BigQuery-- so I end up--
Looker works against almost
every kind of database
there is.
But I really love BigQuery's
SQL, because it is amazing.
It's amazing how
they've implemented it.
And what's different about them
is their arrays, structures,
the UDFs, and the
nesting, as well as
the ability to handle just
astronomical amounts of data.
But the structures and
the UDFs and the nesting
allow you to build stuff
that you can't possibly
build in a traditional
data warehouse.
And it all comes down to
a piece of technologies
that arrays of
structs are really
sub tables on a
single row of data.
Has anybody here played with the
arrays and structs in BigQuery?
Oh, cool.
And UDF?
Cool.
It's super powerful.
So actually, I'm going to
do a demo in a little bit.
But everything comes
down to the fact that--
I'll keep repeating.
This slide is going to
appear a couple of times.
It's really important to
recognize that in a table,
a repeated record
is really just,
in BigQuery, a structure,
an array of structs,
and you can treat that
as if it were a table.
There are basically three
kinds of UDFs in BigQuery.
The obvious one
is the scalar UDF.
You can pass a string,
and parse something out
of a string in a UDF.
And you can do
that in JavaScript,
or you can do that in SQL.
But what's not
necessarily obvious
is that you can aggregate
an array into a structure
and pass it into a UDF and
treat it like it's a table.
And really what's
really that then
allows you to do is fork
queries so that you can act--
you can take a stream of
data, fork some of the data
out, and do some really
interesting processing
and bring back the results.
So all the aggregate
UDFs follow this pattern.
The array agg struct
construct is really
building up a sub table.
So that creates a
structure, that creates
an array of a structure.
And then the UDF, you
pass that through a UDF,
and the UDF does a
select statement on it
and then passes the result back.
Here's an example of a median
function written in SQL.
You'll notice that the
median function takes it,
does an array agg on the numbers
that it's going to collect,
and then an unnest
of that array really
turns that array back into a
table in a select statement.
Does that makes sense?
And I'm going to do a demo
with some interesting UDFs
that I've built. I'm starting
with a table that really
consists of just five fields.
It's a name, a gender,
a state, a year,
and the number of
people that were
born with that name, that gender
in that state in that year, OK?
So this is from the Social
Security Administration.
The data goes from 1910 to 1913.
And it's pretty wild what you
can find with just these five
fields.
So can we switch
over to the demo?
So this is Looker.
And in the IDE I've
built a little model.
So for each of the dimensions
that are in the table,
I've created a LookLM dimension
for the name, state, gender,
year, and the number was an
integer I wanted as a float,
so I cast it.
And I've added a couple of
other measures, a decade, which
is just divided by
10 for the year.
I built a sum
aggregate and a count.
And let's go explore it.
So this is the Looker interface,
the Looker explorer interface.
It's pretty simple.
I can actually--
I can start with the total
number, and the name count,
and I can run that.
And I can just see how
much data there is.
So there are 300,000 names.
29,000 different names, right?
I could add year, and
look at it by year.
Let's sort the year.
And we go to 1930.
Let's see.
So down in-- then
I can see that this
shows to the most
popular names in 1930.
So it's just building
a little data model.
What Looker is doing is that
it's writing queries again
and feeding them into SQL.
So everything that we
click on, it really
ends up being just a SQL query.
Now above are the
UDFs that I had
that I built in for this demo.
But basically,
Looker's just write
SQL crazy against BigQuery
and returns the results.
So you get access
to all the data
no matter what
you're looking at.
So what we-- let's see.
So if I want to look at
the most popular name,
I can look at the name
and then total number,
and I can find the
most popular name.
It's James, John, Robert,
Michael, William, great.
That's all terrific
I can add filters.
Let's look at female names.
And let's look--
let's do median year.
So Mary, the median
year for Mary is 1941.
The median year for
Patricia's 1940, 1951.
The median year for
Elizabeth is 1966.
Now in this data, it's
kind of weird, right?
I just have a state
and the number
of people that were born in
that name with that year.
So what?
The median calculation is not
simply just the simple median.
It's a weighted median.
I have to take a look at the--
the median year is what year
were as many people born
before as after, right?
And I wanted to
build an aggregate
for this, which is I wanted
an aggregate function that
does that.
So I built a weighted
median function.
And the weighted
mean, the median year,
is the median weighted.
It passes the year and
the number of people
that were born in that year.
And I've got a new
aggregate that's in SQL.
And so if I actually go
look at the code here,
it really is quite simple.
I use that weighted
median function
as if it were a built in
aggregate function, which
is really kind of cool.
It's actually taking
something pretty complicated
and building me this new
concept of weighted median
that I can use anywhere.
And let's take a look at how
that function is created.
So we have-- so I built a--
this is all open.
Basically, all of
this is open source
so you can grab it and
play with it if you want.
So there's a weighted
median function.
Here's basically what it is.
And it basically
takes an array of--
it takes that array as
input, does this calculation,
and returns the results.
Pretty cool, right?
But you can do also other
kinds of wild things too.
You can do-- let's
actually filter this.
So filter this by state.
And do-- look at Alabama.
Oh, come on.
Versus California.
Let's pivot the state.
And let's look at
a graph of that.
Now again, all of this is
running as a single SQL query.
OK, so we can see
that Mary in Alabama,
the median year for Mary in
Alabama was 1938, not 1951.
And we can see a little
graph of what it was.
And it is kind of
interesting, because it
was really popular
in 1931 in Alabama,
and it peaked in
1984 in California.
So like maybe it was
Annie Hall or something?
I don't know.
Like it's pretty wild what
you can see in the data.
Again, this is just one
table with five columns in it
and BigQuery on
top and some UDF.
It's pretty cool, right?
[APPLAUSE]
I think it's cool.
[CHUCKLES]
And if you look
at the SQL for it,
if we look at the
definition for this,
we can go back and look at the--
sorry.
We'll look at back
at the SQL here.
All the stuff that
is doing the thing,
but really it's just
a time graph UDF.
And the time graph
takes the data as input
and returns an array
of integers, which
are the amount for the decade.
And then it converts
that to an image, right?
So the SQL is just returning
an array of integers, right?
And so that's pretty
snazzy, right?
Here's another interesting
function that--
how am I doing in time?
Eh.
Actually, I've got
another demo but I'm
going to skip that because I
want to keep rolling forward.
Can we go back to the?
So nesting is the
other thing that you
can do with BigQuery
that's pretty amazing.
So the UDFs are really
cool, because you
can build these aggregates
and do all kinds of wild stuff
with it.
But nesting is the
other thing that
is pretty unusual in
BigQuery in the scale.
So very often in big databases,
joining two large tables
can be a huge problem, right?
Sorry.
There are things that are
really hard to do when you have
a couple of large entities.
So like figuring out
co-occurrence in large tables
can be very expensive,
navigating graphs of data
can be very expensive.
And nesting actually
makes a lot of this stuff
better and easier.
So again, if you think
about it, in a nested table,
basically what you
have is an array
of structure, a record which is
essentially an array of structs
that are stored in the table.
The example, here's
an example of what
a nested table looks like.
You have-- this is an order
table with order items.
An order naturally, if you've
got an e-commerce store that
has orders and you can
order more than one thing,
naturally you'll have an order,
in a traditional SQL database,
you will have an orders table
and an items table, a line
items table.
Or an order and an
order items table.
But in a nested situation,
what you want is
you can put those together
because they don't work
without one another, right?
An order item really doesn't
exist without an order,
so that you can place it in it.
And as we can see
here, the structure
here is an ID, product name,
quantity, and unit price.
That's an order item structure.
And then it's repeated
within each order.
So each of those is
like a little sub
table within the order.
Why do you care about this?
Well, first of all, efficiency.
And second of all,
computability.
If you're trying to figure
out what was bought together
in a really large database,
that's called co-occurrence.
And you'll have a really
hard time figuring out--
basically, you
can join for free.
So any time that you
have a nested table,
it can be up to 100
megabytes in BigQuery.
You can pass that off.
You can use a UDF or you can
use another SQL query to process
against it or join against it.
And that happens in a way
that is super efficient.
So you'll be really surprised
how fast things work.
Nested data that
you should look at
are things like orders
and order items.
Sessions and events
are really hard.
So one of the traditional
things in SQL, in log data,
is you have somebody--
you sessionize your data.
You've got your event data,
you roll it up into a session.
It's a very traditional
MapReduce problem.
And then you're trying to figure
out who logged in and bought.
Who signed up today and bought?
That can be really tricky in
a traditional data warehouse,
but it's very cheap to
do in a nested situation,
because you can join the data
twice and find co-occurrence.
Other places are
user and attributes.
What you don't want to nest is
something that's not naturally
part of it.
So like the user is not really
part of an order or an order
item, because the
user might appear
in multiple different orders.
So you wouldn't nest users.
Does that make sense?
When you join nested
table, it works
just like it does, like you
are joining another table.
It looks complicated, but
it's actually really simple.
Let's take a look.
In this case, I've--
hold on just a second, sorry.
I'm missing a slide.
How did that happen?
I'm sorry.
This is what an
unnest looks like.
In the regular, I don't know
where the slide happened,
so I'm just going to skip.
Sorry.
Anyway, co-occurrence,
graph navigations,
super efficient summary
tables, and they
can be larger than a single row.
Oh, there's the slide.
OK, this is--
[LAUGHTER]
I'm sorry.
I told you I'm not
a great presenter.
So anyway, so this is basically
what a normal join looks like.
And this is what an
unnest join looks like.
They're pretty much identical.
We're out of time?
Looks like I'm
running out of time.
So Jordan, I'm going to call it.
Sorry.
[APPLAUSE]
JORDAN TIGANI: Can I just--
LLOYD TABB: Yeah.
[APPLAUSE]
I just want to put in a plug.
We have a bunch more
BigQuery sessions
to talk in more
detail about some
of the other cool, like
beyond data warehousing
things like BQML,
which I'm sure you've
heard lots and lots about
today, geospatial, et cetera.
And the public datasets
program as well tomorrow.
I also want to
say the BigQuery--
there's a bunch of folks
on the BigQuery team here.
If you see one over
at the BigQuery
booth in Moscone South,
please give us your feedback.
Tell us what you like, what
you don't like, problems
you're having, et cetera.
Finally, I just
want to say BigQuery
is even better with Looker.
I want to thank Lloyd
for being up here with me
and being able to show what
kind of stuff BigQuery can do.
Thank you.
[APPLAUSE]
[MUSIC PLAYING]
