>> Paige: Hello, everybody
and thank you for joining us today
on the Virtual Vertica BDC 2020.
Today's breakout session is entitled
A Technical Overview of
the Vertica Architecture.
I'm Paige Roberts, Open Source
Relations Manager at Vertica
and I'll be your host for this webinar.
Now joining me is Ryan Role-kuh?
Did I say that right? (laughs)
He's a Vertica Senior Software Engineer.
>> Ryan: So it's Roelke. (laughs)
>> Paige: Roelke, okay,
I got it, all right.
Ryan Roelke.
And before we begin,
I want to be sure and encourage you guys
to submit your questions or your comments
during the virtual session
while Ryan is talking
as you think of them as you go along.
You don't have to wait to the end,
just type in your question or your comment
in the question box below
the slides and click submit.
There'll be a Q and A at
the end of the presentation
and we'll answer as many questions
as we're able to during that time.
Any questions that we don't address,
we'll do our best to
get back to you offline.
Now, alternatively, you can
visit the Vertica forums
to post your question there
after the session as well.
Our engineering team is
planning to join the forums
to keep the conversation going,
so you can have a chat
afterwards with the engineer,
just like any other conference.
Now also, you can maximize your screen
by clicking the double arrow button
in the lower right corner of the slides
and before you ask, yes,
this virtual session is being recorded
and it will be available to
view on demand this week.
We'll send you a notification
as soon as it's ready.
Now, let's get started.
Over to you, Ryan.
>> Ryan: Thanks, Paige.
Good afternoon, everybody.
My name is Ryan and I'm a
Senior Software Engineer
on Vertica's Development Team.
I primarily work on
improving Vertica's
query execution engine,
so usually in the space
of making things faster.
Today, I'm here to talk about something
that's more general than that,
so we're going to go
through a technical overview
of the Vertica architecture.
So the intent of this talk, essentially,
is to just explain some
of the basic aspects
of how Vertica works
and what makes it such a
great database software
and to explain what makes a query
execute so fast in Vertica,
we'll provide some background to explain
why other databases don't keep up.
And we'll use that as a starting point
to discuss an academic database
that paved the way for Vertica.
And then we'll explain how Vertica design
builds upon that academic database
to be the great software that it is today.
I want to start by sharing
somebody's approximation
of an internet minute
at some point in 2019.
All of the data on this
slide is generated by
thousands or even millions of users
and that's a huge amount of activity.
Most of the applications depicted here
are backed by one or more databases.
Most of this activity
will eventually result
in changes to those databases.
For the most part, we can categorize
the way these databases are used
into one of two paradigms.
First up, we have online
transaction processing
or OLTP.
OLTP workloads usually operate
on single entries in a database,
so an update to a retail inventory
or a change in a bank account balance
are both great examples
of OLTP operations.
Updates to these data sets
must be visible immediately
and there could be many
transactions occurring concurrently
from many different users.
OLTP queries are usually
key value queries.
The key uniquely identifies
the single entry in a database
for reading or writing.
Early databases and applications
were probably designed for OLTP workloads.
This example on the slide is
typical of an OLTP workload.
We have a table, accounts,
such as for a bank,
which tracks information for
each of the bank's clients.
An update query, like
the one depicted here,
might be run whenever a user deposits $10
into their bank account.
Our second category is online
analytical processing or OLAP
which is more about using
your data for decision making.
If you have a hardware device
which periodically records how it's doing,
you could analyze trends of
all your devices over time
to observe what data patterns
are likely to lead to failure
or if you're Google, you
might log user search activity
to identify which links helped
your users find the answer.
Analytical processing
has always been around
but with the advent of the internet,
it happened at scales
that were unimaginable,
even just 20 years ago.
This SQL example is
something you might see
in an OLAP workload.
We have a table, searches,
logging user activity.
We will eventually see
one row in this table
for each query submitted by users.
If we want to find out what time of day
our users are most active,
then we could write a query
like this one on the slide
which counts the number of
unique users running searches
for each hour of the day.
So now let's rewind to 2005.
We don't have a picture of
an internet minute in 2005,
we don't have the data for that.
We also don't have the data
for a lot of other things.
The term Big Data is not
quite yet on anyone's radar
and The Cloud is also not quite there
or it's just starting to be.
So if you have a database
serving your application,
it's probably optimized
for OLTP workloads.
OLAP workloads just aren't mainstream yet
and database engineers probably
don't have them in mind.
So let's innovate.
It's still 2005 and we want to
try something new with our database.
Let's take a look at what happens
when we do run an
analytic workload in 2005.
Let's use as a motivating example
a table of stock prices over time.
In our table, the symbol column
identifies the stock that was traded,
the price column identifies the new price
and the timestamp column
indicates when the price changed.
We have several other columns which,
we should know that they're there,
but we're not going to use
them in any example queries.
This table is designed
for analytic queries.
We're probably not going
to make any updates
or look at individual rows
since we're logging historical data
and want to analyze changes
in stock price over time.
Our database system is built
to serve OLTP use cases,
so it's probably going to
store the table on disk
in a single file like this one.
Notice that each row contains
all of the columns of our data
in row major order.
There's probably an index
somewhere in the memory of the system
which will help us to point lookups.
Maybe our system expects that
we will use the stock symbol
and the trade time as lookup keys.
So an index will provide quick lookups
for those columns
to the position of the
whole row in the file.
If we did have an update to a single row,
then this representation would work great.
We would seek to the row
that we're interested in,
finding it would probably be very fast
using the in-memory index.
And then we would update the file in place
with our new value.
On the other hand,
if we ran an analytic
query like we want to,
the data access pattern is very different.
The index is not helpful because
we're looking up a whole range of rows,
not just a single row.
As a result, the only way to find the rows
that we actually need for this query
is to scan the entire file.
We're going to end up scanning
a lot of data that we don't need
and that won't just be the
rows that we don't need,
there's many other columns in this table.
Many information about
who made the transaction,
and we'll also be scanning
through those columns
for every single row in this table.
That could be a very serious problem
once we consider the scale of this file.
Stocks change a lot,
we probably have thousands or millions
or maybe even billions of rows
that are going to be stored in this file
and we're going to scan
all of these extra columns
for every single row.
If we tried out our stocks use case
behind the desk for the
Fortune 500 company,
then we're probably going
to be pretty disappointed.
Our queries will eventually finish,
but it might take so long
that we don't even care
about the answer anymore
by the time that they do.
Our database is not built
for the task we want to use it for.
Around the same time,
a team of researchers
in the North East have
become aware of this problem
and they decided to dedicate
their time and research to it.
These researchers weren't just anybody.
The fruits of their labor,
which we now like to
call the C-Store Paper,
was published by eventual
Turing Award winner,
Mike Stonebraker, along with
several other researchers
from elite universities.
This paper presents the design of
a read-optimized relational DBMS
that contrasts sharply
with most current systems,
which are write-optimized.
That sounds exactly like what we want
for our stocks use case.
Reasoning about what makes
our queries executions so slow
brought our researchers
to the Memory Hierarchy,
which essentially is a visualization
of the relative speeds of
different parts of a computer.
At the top of the hierarchy,
we have the fastest data units,
which are, of course, also
the most expensive to produce.
As we move down the hierarchy,
components get slower
but also much cheaper
and thus you can have more of them.
Our OLTP databases data
is stored in a file on the hard disk.
We scanned the entirety of this file,
even though we didn't
need most of the data
and now it turns out, that is just about
the slowest thing that our
query could possibly be doing
by over two orders of magnitude.
It should be clear, based on that,
that the best thing we can do
to optimize our query's execution
is to avoid reading
unnecessary data from the disk
and that's what the C-Store
researchers decided to look at.
The key innovation of the C-Store paper
does exactly that.
Instead of storing data
in a row major order,
in a large file on disk,
they transposed the data
and stored each column in its own file.
Now, if we run the same select query,
we read only the relevant columns.
The unnamed columns don't factor
into the table scan at all
since we don't even open the files.
Zooming out to an internet
scale sized data set,
we can appreciate the
savings here a lot more.
But we still have to read a lot of data
that we don't need to answer
this particular query.
Remember, we had two predicates,
one on the symbol column and
one on the timestamp column.
Our query is only
interested in AAPL stock,
but we're still reading rows
for all of the other stocks.
So what can we do to optimize
our disk read even more?
Let's first partition our
data set into different files
based on the timestamp date.
This means that we will keep
separate files for each date.
When we query the stocks table,
the database knows all of
the files we have to open.
If we have a simple predicate
on the timestamp column,
as our sample query does,
then the database can use it to figure out
which files we don't
have to look at at all.
So now all of our disk reads
that we have to do to answer our query
will produce rows that pass
the timestamp predicate.
This eliminates a lot
of wasteful disk reads.
But not all of them.
We do have another predicate
on the symbol column
where symbol equals AAPL.
We'd like to avoid disk reads of rows
that don't satisfy that predicate either.
And we can avoid those disk reads
by clustering all the rows
that match the symbol predicate together.
If all of the AAPL rows are adjacent,
then as soon as we see
something different,
we can stop reading the file.
We won't see any more rows
that can pass the predicate.
Then we can use the positions
of the rows we did find
to identify which pieces
of the other columns
we need to read.
One technique that we can use
to cluster the rows is sorting.
So we'll use the symbol column
as a sort key for all of the columns.
And that way we can
reconstruct a whole row
by seeking to the same
row position in each file.
It turns out, having
sorted all of the rows,
we can do a bit more.
We don't have any more wasted disk reads
but we can still be more efficient
with how we're using the disk.
We've clustered all of the rows
with the same symbol together
so we don't really need
to bother repeating
the symbol so many times in the same file.
Let's just write the value once
and say how many rows we have.
This one length encoding technique
can compress large numbers of rows
into a small amount of space.
In this example, we do
de-duplicate just a few rows
but you can imagine de-duplicating
many thousands of rows instead.
This encoding is great for
reducing the amounts of disk
we need to read at query time,
but it also has the additional benefit
of reducing the total
size of our stored data.
Now our query requires
substantially fewer disk reads
than it did when we started.
Let's recap what the C-Store
paper did to achieve that.
First, we transposed our data
to store each column in its own file.
Now, queries only have to read
the columns used in the query.
Second, we partitioned the
data into multiple file sets
so that all rows in a
file have the same value
for the partition column.
Now, a predicate on the partition column
can skip non-matching file sets entirely.
Third, we selected a column of our data
to use as a sort key.
Now rows with the same
value for that column
are clustered together,
which allows our query
to stop reading data
once it finds non-matching rows.
Finally, sorting the data this way
enables high compression ratios,
using one length encoding
which minimizes the size of
the data stored on the disk.
The C-Store system combined
each of these innovative ideas
to produce an academically
significant result.
And if you used it behind the desk
of a Fortune 500 company in 2005,
you probably would've been pretty pleased.
But it's not 2005 anymore
and the requirements of
a modern database system
are much stricter.
So let's take a look at
how C-Store fairs in 2020.
First of all, we have designed
the storage layer of our database
to optimize a single query
in a single application.
Our design optimizes the
heck out of that query
and probably some similar ones
but if we want to do
anything else with our data,
we might be in a bit of trouble.
What if we just decide we want
to ask a different question?
For example, in our stock example,
what if we want to plot all the trade
made by a single user over
a large window of time?
How do our optimizations
for the previous query
measure up here?
Well, our data's partitioned
on the trade date,
that could still be useful,
depending on our new query.
If we want to look at a trader's activity
over a long period of time,
we would have to open a lot of files.
But if we're still interested
in just a day's worth of data,
then this optimization
is still an optimization.
Within each file, our data is
ordered on the stock symbol.
That's probably not too useful anymore,
the rows for a single trader
aren't going to be clustered together
so we will have to scan all of the rows
in order to figure out which ones match.
You could imagine a worse design
but as it becomes crucial
to optimize this new type of query,
then we might have to go as far as
reconfiguring the whole database.
The next problem of one of scale.
One server is probably not good enough
to serve a database in 2020.
C-Store, as described,
runs on a single server
and stores lots of files.
What if the data overwhelms
this small system?
We could imagine exhausting
the file system's inodes limit
with lots of small files due
to our partitioning scheme.
Or we could imagine something simpler,
just filling up the disk
with huge volumes of data.
But there's an even
simpler problem than that.
What if something goes
wrong and C-Store crashes?
Then our data is no longer available to us
until the single server
is brought back up.
A third concern, another
one of scalability,
is that one deployment
does not really suit
all possible things and
use cases we could imagine.
We haven't really said
anything about being flexible.
A contemporary database system
has to integrate with
many other applications,
which might themselves have
pretty restricted deployment options.
Or the demands imposed by
our workloads have changed
and the setup you had before
doesn't suit what you need now.
C-Store doesn't do anything
to address these concerns.
What the C-Store paper did do
was lead very quickly to
the founding of Vertica.
Vertica's architecture and design
are essentially all about
bringing the C-Store designs
into an enterprise software system.
The C-Store paper was
just an academic exercise
so it didn't really need to address
any of the hard problems
that we just talked about.
But Vertica, the first commercial database
built upon the ideas of the C-Store paper
would definitely have to.
This brings us back to the present
to look at how an analytic
query runs in 2020
on the Vertica Analytic Database.
Vertica takes the key idea from the paper,
can we significantly
improve query performance
by changing the way our data is stored
and give its users the tools
to customize their storage layer
in order to heavily
optimize really important
or commonly wrong queries.
On top of that, Vertica
is a distributed system
which allows it to scale up
to internet-sized data sets,
as well as have better
reliability and uptime.
We'll now take a brief look
at what Vertica does to
address the three inadequacies
of the C-Store system that we mentioned.
To avoid locking into a
single database design,
Vertica provides tools
for the database user
to customize the way their data is stored.
To address the shortcomings
of a single node system,
Vertica coordinates processing
among multiple nodes.
To acknowledge the large variety
of desirable deployments,
Vertica does not require
any specialized hardware
and has many features
which smoothly integrate it
with a Cloud computing environment.
First, we'll look at the
database design problem.
We're a SQL database,
so our users are writing SQL
and describing their data in SQL way,
the Create Table statement.
Create Table is a logical description
of what your data looks like
but it doesn't specify the
way that it has to be stored,
For a single Create Table,
we could imagine a lot of
different storage layouts.
Vertica adds some extensions to SQL
so that users can go even
further than Create Table
and describe the way that they
want the data to be stored.
Using terminology from the C-Store paper,
we provide the Create
Projection statement.
Create Projection specifies
how table data should be laid out,
including column encoding and sort order.
A table can have multiple projections,
each of which could be
ordered on different columns.
When you query a table,
Vertica will answer the
query using the projection
which it determines to be the best match.
Referring back to our stock example,
here's a sample Create Table
and Create Projection statement.
Let's focus on our heavily
optimized example query,
which had predicates on
the stock symbol and date.
We specify that the table data
is to be partitioned by date.
The Create Projection Statement here
is excellent for this query.
We specify using the order by clause
that the data should be ordered
according to our predicates.
We'll use the timestamp
as a secondary sort key.
Each projection stores a
copy of the table data.
If you don't expect to need
a particular column in a projection,
then you can leave it out.
Our average price query
didn't care about who did the trading,
so maybe our projection
design for this query
can leave the trader column out entirely.
If the question we want
to ask ever does change,
maybe we already have
a suitable projection,
but if we don't, then we
can create another one.
This example shows another projection
which would be much better at
identifying trends of traders,
rather than identifying
trends for a particular stock.
Next, let's take a look at
our second problem, that one,
or excuse me,
so how should you decide
what design is best for your queries?
Well, you could spend a lot of time
figuring it out on your own,
or you could use Vertica's
Database Designer tool
which will help you by
automatically analyzing your queries
and spitting out a design which
it thinks is going to work really well.
If you want to learn more about
the Database Designer Tool,
then you should attend the session
Vertica Database Designer-
Today and Tomorrow
which will tell you a lot about
what the Database Designer does
and some recent improvements
that we have made.
Okay, now we'll move to
our next problem. (laughs)
The challenge that one
server does not fit all.
In 2020, we have several
orders of magnitude more data
than we had in 2005.
And you need a lot more
hardware to crunch it.
It's not tractable to keep
multiple petabytes of data
in a system with a single server.
So Vertica doesn't try.
Vertica is a distributed system
so will deploy multiple severs
which work together to maintain
such a high data volume.
In a traditional Vertica deployment,
each node keeps some of the data
in its own locally-attached storage.
Data is replicated so that
there is a redundant copy
somewhere else in the system.
If any one node goes down,
then the data that it served
is still available on a different node.
We'll also have it so that in the system,
there's no special node with extra duties.
All nodes are created equal.
This ensures that there is
no single point of failure.
Rather than replicate all of your data,
Vertica divvies it up
amongst all of the nodes in your system.
We call this segmentation.
The way data is segmented
is another parameter of
storage customization
and it can definitely have an
impact upon query performance.
A common way to segment data
is by using a hash expression,
which essentially randomizes the node
that a row of data belongs to.
But with a guarantee that the same data
will always end up in the same place.
Describing the way data is segmented
is another part of the
Create Projection Statement,
as seen in this example.
Here we segment on the
hash of the symbol column
so all rows with the same symbol
will end up on the same node.
For each row that we load into the system,
we'll apply our segmentation expression.
The result determines which
segment the row belongs to
and then we'll send the row to each node
which holds the copy of that segment.
In this example, our
projection is marked KSAFE 1,
so we will keep one redundant
copy of each segment.
When we load a row, we might find that
its segment had copied on
Node One and Node Three,
so we'll send a copy of the
row to each of those nodes.
If Node One is temporarily
disconnected from the network,
then Node Three can serve
the other copy of the segment
so that the whole system
remains available.
The last challenge we brought
up from the C-Store design
was that one deployment does not fit all.
Vertica's cluster design neatly addressed
many of our concerns here.
Our use of segmentation to
distribute data means that
a Vertica system can scale
to any size of deployment.
And since we lack any special hardware
or nodes with special purposes,
Vertica servers can run anywhere,
on premise or in the Cloud.
But let's suppose you need
to scale out your cluster
to rise to the demands
of a higher workload.
Suppose you want to add another node.
This changes the division
of the segmentation space.
We'll have to re-segment
every row in the database
to find its new home
and then we'll have to
move around any data
that belongs to a different segment.
This is a very expensive operation,
not something you want to
be doing all that often.
Traditional Vertica
doesn't solve that problem
especially well, but Vertica
Eon Mode definitely does.
Vertica's Eon Mode is
a large set of features
which are designed with
a Cloud computing environment in mind.
One feature of this design
is elastic throughput scaling,
which is the idea that you can
smoothly change your cluster size
without having to pay the expenses
of shuffling your entire database.
Vertica Eon Mode had an entire session
dedicated to it this morning.
I won't say any more about it here,
but maybe you already
attended that session
or if you haven't, then I
definitely encourage you
to listen to the recording.
If you'd like to learn more
about the Vertica architecture,
then you'll find on this slide
links to several of the academic
conference publications.
These four papers here,
as well as Vertica Seven Years Later paper
which describes some
of the Vertica designs
seven years after the founding
and also a paper about the
innovations of Eon Mode
and of course, the Vertica documentation
is an excellent resource for learning more
about what's going on in a Vertica system.
I hope you enjoyed learning
about the Vertica architecture.
I would be very happy to take
all of your questions now.
Thank you for attending this session.
