JEFF DAVIS: Welcome to the
Google Cloud Platform education
series.
My name is Jeff Davis,
and in this session
we'll discuss analyzing
your data in less time using
Google BigQuery.
We'll be running a moderated
QA during the session,
so feel free to use the
Ask your question field
at the bottom of your
screen to submit questions
as we go along.
Let's get started.
So the agenda for this
session will be starting out
with a series of demos.
We'll show you typical
use cases and the speed
at which BigQuery operates
and how easy it is to use.
Then we'll talk about the
general set of features
for BigQuery and how you
might be able to leverage them
in your environment.
We'll talk about how to find
answers to questions after you
leave today's
session and you get
started working with
BigQuery on your own.
We'll talk about some best
practices for evaluating
BigQuery in your environment.
And again, we'll be
doing Q&A as we go along,
so submit those questions.
Let's start with
a series of demos
that show both BigQuery's
scale, as well as
how easy it is to
integrate in other systems
that you use for data
analysis and reporting.
What we have here
is the web UI that
allows end users to easily
play with data within BigQuery.
We're going to show you
some integrations later
on, but right now this is
just for doing quick iterative
analysis of large data sets.
So let's start with
some sample data
sets that Google
makes available.
So for instance, here we
have airline on time data.
So we have a list of flights.
And if we take a
look at this table,
you can see it's
about an 8 GB file.
It's 70 million rows of
data regarding different
airlines' on time, arrival,
and departure performance
in different states
and so forth.
And I want to get
some information.
So I'm going to start with a can
query that I've got here, OK?
And this is going to find
all the arrival entries just
for United Airlines.
So I'm going to copy this,
go back, and query the table.
OK, you can see I'm
using standard SQL.
I'm going to be querying
the flights table,
looking for all rows where the
airline code is 19977, that's
the code for United Airlines.
And I'm going to select
three columns from every row.
Now I'm going to disable
caching because we
want to see how
long it takes to do
an initial query against
the 70 odd million rows.
You'll notice it gives me
an estimate of how much data
it's going to have to process.
This is also useful
for estimating
how much it's going to
cost and that you pay,
I think, it's $5 per
terabyte of data process
after the first terabyte.
So this uses some
fraction of that.
So I'm going to run
this and it should
take between 25 to 30 seconds
to complete the query.
When it comes back, it
will show us the three
rows for all the columns.
It will give us a column count.
And then we'll move on to a
slightly more interesting query
that gives us some more
insight into our data.
So you'll notice it
took 23.8 seconds.
Processed 2 gigabytes of data.
It returned 5,043,578 rows.
So it's a lot of data returned.
That's part of why it
took so long to come back.
It would have been
faster if there had been
fewer rows in the result set.
So what this does is it just
gives us the date, the arrival
state, and the arrival delay.
I would like to get a
little bit more insight,
so what I'm going to
do is run this query.
This query should tell me
the average delay by month
by arrival state
for United Airlines.
So I want to see what
months are the worst
and what states are the worst.
So let's run this.
Again, it's going to process
the same amount of data.
That came back
quite a bit faster.
And you'll notice that we
have 5,000 rows instead
of 5 million rows returned.
That's part of why it
came back so quickly.
And you'll notice
that what we've got
is for each state we've
got individual months.
And for those months, we've got
the average delay, the number
of flights, the number of
flights that were early,
the number of flights
that were late,
the percentage of
flights that were early,
and the percentage of
flights that were late.
So this is using the lingua
franca of data analysis.
It's using standard SQL.
It's not that
difficult to write.
There's no actual query
optimization here.
You'll notice I'm
also using subqueries.
That's perfectly valid.
And I was able to
very quickly get back
insights from a very
large collection of data.
So again, you can do this all
interactively through the web
UI.
Let's actually
look at just scale.
How fast can this go on
really big data sets?
So you think, well,
Jeff, 75 million rows.
That's a pretty big data set.
Well, it's not as big as some of
these data sets, for instance.
Here we have Wikipedia
benchmark data.
And if you look, you can
see that the table basically
has got year, month, day,
project, language, title,
and view.
So this has got data
about how many times
particular articles were viewed.
So we've got a
million row version.
We have a 10 million
row version, 100 million
row version, a
billion row version.
Let's see how this all scales.
So we'll start with a million.
I'm going to turn off the
standard SQL and use legacy
SQL just for giggles.
So we're going to query
a million articles.
We're looking for all
articles that have
Davis because I'm fairly vain.
I'm going to look for
articles about people
with my same last name.
We're going to use a regex
expression to match these,
and what I want to do is
actually group them by title
and sum the number of views.
So let's do this over
a million row data set.
This is 47 megabytes.
And it should come
back fairly quickly.
So you'll notice Anthony
Davis and Thomas Davis
are the most frequently viewed
articles in the 1 million
entry data set.
Let's look at 10 million.
OK, so we go from 47
megabytes to 476 megabytes.
And we go from just a couple
of seconds to three seconds.
We now have roughly
2,500 results.
Daniel Davis is the top now.
So this was three
seconds to do 10 million.
Let's do 100 million.
And you see where
we're going with this.
You want to know if I'm going to
get to the 100 billion number,
and I am.
I'm going to have to pay $20 out
of my pocket to run that query,
but I'll do it just for you.
So you notice, ah,
Jefferson Davis.
Not my namesake.
It's a coincidence.
Favorite jazz musician,
Miles Davis, is up there.
So we've done 100 million.
Let's do, oh, what the heck.
That took two seconds.
Let's do 100 billion.
This is going to process
3.64 terabytes of data,
100 billion rows.
Cross our fingers.
Get out my ATM card.
And yet, think about it.
Twenty dollars to run this
query versus the amortized cost
of a query against a
big Teradata system,
or some other big data
rig that you've got.
It's going to give you
equivalent performance.
It's actually a heck of a value.
So I think it's about 30 to 50
seconds to complete the query.
While that's running, I'm going
to get-- oh, actually it's
done.
So 37 seconds.
Notice it came back with
185,000 rows in the results
and it has aggregated and told
me that Miles Davis is in fact
the most popular article within
the 100 billion row data set.
So think about it.
That's 30 seconds to do
analysis on 100 billion rows.
I could sit here and
iterate on my queries,
interact almost in real time
with my data very, very easily.
Now obviously, doing this
interactively with the web UI
is one thing.
Another thing is to
actually expose this data,
pull it out of
BigQuery and expose it
through some other application.
So we have an example here.
This is a customer
success application
that pulls data from BigQuery.
So let's take a look at
one of the dashboards.
If you went to the
App Engine session,
you saw this dashboard earlier.
This has basically got a
bunch of visualizations
of different types of
user engagement metrics
within my product.
And what this allows me to do
is pull the data out of BigQuery
through APIs and
represent it in code.
So behind the scenes here
I've got a user events table.
Now, this table is
paltry 500,000 rows,
but it's a modest amount of data
for the usage of this product.
And it's a demo
application, so it doesn't
have tens of millions of rows.
What we do then is we
write a little bit of code
to access this.
And here's the code.
We're going to go to BigQuery.
This is a little Python script.
You'll notice what we do
is we import a library.
So Google has code libraries
or client libraries
available for a variety
of different languages,
and it's quite simple.
We create a BigQuery client,
we connect to the data set,
and then within the data
set, we connect to a table.
And within the table, we are
then going to run a query.
If we want to run an
arbitrary query for instance,
we take our client, we run a
query, we designate the query.
We give it a time out,
so this makes sure
that the query has up
to 45 seconds to run.
We run the query,
and then we simply
iterate through the results.
And this allows us to
process through the results
of the query.
If you want to
write to BigQuery,
it's quite straightforward.
You simply take the table
and you insert a row.
And this adds events
into the backend.
So this data gets written
automatically, then the API
pulls it out and
renders it in dashboards
within your application.
Now you may be using a different
tool to do your visualization
and reporting.
You may not use the
web UI and you may not
want to go so far as to
write a custom dashboarding
application.
Let's say you want
to use something
like Tableau or Looker, or in
this case, I'm using a product
called Mode Analytics.
These BI visualization
and analysis tools
allow you to connect to
BigQuery as a backend.
So you'll notice over here.
Let me make this a
little bit bigger.
You'll notice I
have data sources.
If I click on the plus
sign, I can actually
select a backend
data source, and I
want to connect to BigQuery.
OK.
So it's saying I want
to connect to BigQuery.
It wants to know my project ID.
So I am going to pick my project
ID, which is GCP webinar demos.
And then I need to
create a service account.
And so the service
account is created
through IAM, the identity
and access management.
And you'll notice that I've got
a variety of service accounts
here.
I can go ahead and create
a new service account.
Actually, sorry, service
accounts are here.
Those were the roles.
I'm going to create
a new service account
and I'm going to call this mode.
And I want to give this
service account BigQuery admin
permissions, so that it can
create tables, delete tables,
read data, and so forth.
And I want to get
a new private key.
And because I read the
Mode documentation,
I know that it needs
to be a P12 key.
So I'm going to create
this service account.
And I get the credentials.
OK?
Actually we'll
cancel out of that.
I don't want to put
it in my key chain.
We'll just skip over that part.
We go back to Mode and
we've got to give it
the name of the service account.
So the name of the service
account or the email address
here is this.
So we designate that and then
we have to upload the key file.
And that should be the key file.
And if all went well, this
has now connected my tool
with BigQuery.
So I can create a new query.
Let's pick one that looks a lot
like what I was using before.
Let's not use the 100 billion.
We can use the 100 million.
OK, now I need to enter a query.
So I'm going to go pick one
of my pre-canned queries.
Let's go ahead and use
the standard SQL format
for getting information about
United's on time arrival.
I can paste that in.
It will run the query.
It gives me a tabular
version of the results.
And if I want to, I can
create a chart for instance
that shows on time arrival.
I can change the chart
type if I want to.
This sort of thing.
So this shows how you can do
analysis and visualization
within a third party tool,
so you can fit BigQuery
into your normal data analysis
and analytics workflows.
Pretty cool.
OK.
Let's switch back and talk
about BigQuery's functionality
overall.
We showed you some pieces of it.
Let's highlight
additional capabilities
and use cases so
that you'll have
a better understanding of how to
apply it in your environments.
So again, BigQuery is
this fully managed,
massive scale, low
cost enterprise data
warehouse that Google
provides running
on top of their compute storage
and networking infrastructure.
There is a demo query that
I've seen at some conferences
that when you run it, it takes
3,300 cores, 330 hard drives,
and 330 gigabits of network
bandwidth for 30 seconds
to complete.
Now those are
tremendous numbers.
But because Google has so
much available capacity,
they're able to give you a slice
of 30 seconds with that level
of resources to run
your query and then
claw those resources
back and deploy them
against somebody else's query.
There is no infrastructure
for you to manage.
You'll often hear this referred
to as being serverless.
Well, it's clearly
not serverless
because there are
servers involved,
but you don't have to
deal with any servers.
You focus just on
getting your data
in and on analyzing your data.
You don't have to worry about
standing up servers, tearing
down servers, configuration
updates, any of that stuff.
So when you perform
queries, you're
going to get results
really, really quick.
So we showed scanning terabytes
of data in roughly 35 seconds.
At the petabyte level,
you're talking a few minutes.
So this actually makes turning
around analysis on a huge data
sets very, very fast.
Which means better
analysis, more creativity,
more interesting insights
that you can derive.
It uses standard SQL.
I showed you that there
is a legacy SQL syntax,
but in the newest generation
of their query language
it's ANSII 2011 SQL
compliant, so you can actually
lift queries from
other environments
and with very little
tweaking if any they will run
in the BigQuery environment.
You can use the web
UI as I demonstrated,
or you can use third
party tools of choice,
like Looker, Tableau, Click.
They've got date
partition tables,
so you can actually have
BigQuery automatically
partition time series
data by day or by date.
And this allows you to have
much higher performance queries.
You can also have BigQuery query
against external data sources.
So you can have a bunch of
data inside of BigQuery,
and then you can join that data
against external source data
as well.
How do you get these massive
amounts of data into BigQuery?
Well, you can do
this interactively.
Like you can upload
files through the web UI,
or you can use the BQ
command line utility
and upload a file
from your local PC.
However, obviously that's
going to work better
for a few megabytes
of data than it will
for a few terabytes of data.
So there are other ways
that make more sense
for transferring
really large volumes.
You can stream data
directly into BigQuery
query at 100,000
rows per second.
So one of the things
that we did at Highfive,
this video conferencing
startup that I used to work at,
is we actually would generate--
So you had 10,000
users and calls.
For all 10,000 calls
we were generating
second by second
telemetry data on hundreds
of different counters
on every single call.
We were writing that into
a production database,
but then streaming
it into BigQuery.
And BigQuery was the
data store from which
the customer facing
team dashboards
drew all of their data.
Google Cloud Dataflow is
a data pipeline framework
that allows you to easily stand
up ETL sort of transformation
jobs that take data
from incoming sources,
manipulate the data,
and write them out
into some target repository.
Dataflow works
beautifully with BigQuery.
And so you can easily
be processing batch data
or streaming data and pumping it
through a transformation engine
into BigQuery.
There's also the newly announced
BigQuery Transfer Service
where you can set up regular
transfers of your YouTube data,
your DoubleClick, data
and your AdWords data
and have that stream
automatically into BigQuery
for performing analytics.
There will be
additional services
coming to BigQuery Transfer
Service in the future.
Right now through middle of
2017 BigQuery Transfer Service
is free.
After the midpoint
in the year, it
will become a billable
monthly service.
Availability and access.
Your data is replicated
across multiple geographies.
What that means is that the
durability and availability are
through the roof.
If you have a device go down,
if you have a zone go down,
if you have a region
go down, you're
still going to be able to
get access to your data,
and you will not lose any data.
You also then can control
who can access that data.
So there are identity and
access management roles
that you can assign.
So you can say developers
can write data,
and your analysts can read data.
Or your end users can read data.
You also can set permissions
at the per data set level.
You have really high
availability, as I mentioned.
A 99.9% SLA on availability.
The data that you store in
BigQuery is encrypted at rest.
We mentioned the fact that
you can use IAM roles,
and the fact that you can either
go through command line or web
UI or external clients.
Pricing.
So there's two components
to pricing for BigQuery.
There's a storage component and
a query processing component.
So the storage
component, you pay
$0.02 per gigabyte per month.
If you store the data
for longer than 30 days,
you actually get a
discount on that.
So if the data
doesn't change but you
keep querying against it, the
price automatically goes down.
You pay $5 per terabyte of
data processed for queries.
The first terabyte is free.
For companies that do
more than $10,000 a month
in analytics processing
on a monthly basis,
they can actually get
flat rate pricing.
And you buy what
are called slots.
It's sort of technical,
but it's a different model
for processing fees.
And you can get what
essentially is a discount
by using flat rate pricing.
If your volume goes
above $10,000 a month,
certainly up to about
$40,000, $50,000 a month,
talk to your Google sales
rep and they can hook you up.
It should be noted
that BigQuery is not
a replacement for every
enterprise data store.
It's not a transactional
relational database.
So if you're looking to read
and write into individual cells
in rows in a table and
constantly update that,
that's not what this is for.
BigQuery is data that you write
in, you don't change the data,
but you're doing reporting
against the data.
Similarly, it's not an
operational data store,
and it is a
cloud-based solution.
So if you have an
application that
requires on premise
data warehousing,
BigQuery would
not work for that.
But you want to ask
yourself, are the benefits
of BigQuery worth
making the changes
necessary in our workflow to
put this data in the cloud?
A lot of organizations
didn't think
that they would want
to or be able to do
their data warehousing
in the cloud are finding
that the benefits are great
enough that they can find
creative ways to make it work.
So given this, you want to
work with your internal teams
to figure out what can
we do with BigQuery.
So what Shine
technologies did was
they were working
with a customer
that did a huge amount
of internet advertising.
And what this
customer wanted to be
able to present to
their executives
was a live dashboard that showed
the performance of their ads.
So what they needed to have was
the ability to stream millions
and millions of ad
impressions into BigQuery,
generate the analytics on
this constantly updated data
set that had billions
of rows, and do
complex queries to
drive insights, and then
visualize these.
And Shine was able to do this
without any Hadoop clusters,
without any sophisticated
systems operations investment.
And they were able to deliver
the clients' dashboards
that would consistently
in 20 to 30 seconds
render visualizations
of up to date data.
Saved tons of money.
Actually became a
product that they
were able to start delivering
to other customers as well.
Now when you walk away
from today's session,
you may have queries as
you're trying things out
and you may need guidance on
how to perform certain tasks,
so where do you look
for data or information?
How do you find what
you're looking for?
So as always, you want to start
with search, and just type
in BigQuery and your topic.
So for instance,
you want to know
how to get data into BigQuery.
BigQuery ingestion.
And you'll notice you've got
preparing data for loading,
streaming data in, improving
data ingestion times,
and so forth, OK?
So for instance,
this article talks
about using Avro, which
compresses the data before it
transmits it in and decompresses
it on the other end.
This is built into
BigQuery query.
It makes data loading
much more efficient.
There's also the typical
table of contents
in BigQuery documentation.
So let me actually take
you on a quick tour.
So if we type in BigQuery.
Up here at the top will be
the splash page from Google.
If you click on
View BigQuery docs,
you'll get the typical
table of contents.
There's a Quickstart here.
Actually two.
One for the web UI, and
one for the command line.
These get you up and running
in less than five minutes.
They give you a
very quick taste.
And then you've got a drill
down into how to guide.
So you can see how
do I load data?
How do I prepare the
data before I load it?
How do I transform the data?
How do I load it from data
store, or cloud storage,
or via API?
How do I stream it in?
And then here's,
how do I do queries?
And there's a section down
here on how do I export data?
And then general basics around
security, and management,
and so forth.
So pretty well organized,
reasonably exhaustive
reference.
Should be very useful to you.
Now, I don't have a
prescribed methodology
for how to do an
evaluation of BigQuery
because it really depends
on your applications.
I've heard two very different
approaches commonly used
by companies that have large
scale data warehousing needs.
One is to take your
top half dozen queries.
So your top half
dozen jobs that are
breaking on your
existing infrastructure,
or just take a long time,
and move them to BigQuery.
So you just pick five or six
queries, you load your data,
you run the queries, you see
how much better they are.
If they're better, great.
You're ready to go.
On the other end of the spectrum
are companies where they really
are going to be very methodical
and exhaustive about the way
they do the evaluation.
So before they can move any
of their data into the cloud,
they have to do compliance
and regulatory research.
They have to cleanse
all of their data.
They may need to
anonymize the data.
They load massive
amounts of data
and then they
essentially replicate
their existing analytics
workflows in BigQuery.
And then they may go so
far as to actually do
blind taste tests.
So they have their front end
representation be identical,
but they have half of the users
go through the new BigQuery
enabled pipeline, and half
the users go through the old.
And they judge the efficacy,
the level of satisfaction,
and so forth.
So it's going to be
up to you, and it
will depend on what
vertical you're in,
what data you're looking
at, and so forth.
Certainly, quick
and dirty way to get
a sense of what the
potential benefits are
is to do the half dozen queries.
And then depending on your
long-term applications,
you may need to do a much more
exhaustive proof of concept.
So that's it for this session.
Thanks so much for joining us.
I hope you found it
interesting and helpful.
We're going to send
you a follow-up email,
and it will have a
link to a code lab,
so you'll get a guided walk
through BigQuery that you
can do on your own.
We'd love to get
your feedback, so
feel free to provide
the feedback in the UI
in front of you, and please
stay tuned for the next session.
