Hi there, I'm Ted Dunning from HPE.
I work in the CTO office with all things to
do with data fabric and what I want to show
you today is some of the things that we can
do, which you can do in dealing with data so
that you can have the storage efficiency of
the most efficient encodings and yet the
performance of the most performant
encodings in the same system.
So we're going to look at today
is a fairly typical data ingestion pipeline.
And so data comes from here.
That's the raw data that's coming in.
It's going to be coming in at a
rate of about 100 terabytes per day.
Now, the rate would not be distributed evenly so
we can't just divide that by 24 hours
and see what we're gonna
get for an hourly rate.
We are going to estimate something
like 2x that at peak.
And then the analytical data that we're going
to retain, we're going to retain for three
months and it's going to be
considerably compressed from the original.
And then finally, we're going to keep
analytical summaries and aggregates for a long
period of time. So let's analyze that.
But let's start by talking
about how encodings work.
In these systems, the way that we store data
is not just on one computer, a server that
has multiple disks. No, we will
use multiple computers working together.
A group like this.
Now, it's not just four, of course.
In practical settings, it could be thousands,
depending on how big your data is.
But generally, it's at least five.
And, again typically, the bottom
end is around 10 servers.
We group these together into a single
domain of operation that works relatively
autonomously to store all of our data
and we call that a cluster.
It's no big deal, just what we call it.
The all work together.
Now, if we have data that we want to store like
this what we do is we write it onto the
disk of one of those machines.
But that's not quite good enough because disks
fail, disks fails at several percent per
year, according to some studies
from about 10 years ago.
It's not clear how big the failure rate is,
but we have to account for the worst case.
So if that disk fails, or worse, if
that entire server fails, our data would become
inaccessible at best and
lost entirely at worst.
So what we want to do is not
just store the data on one machine.
We want to store it
on multiple machines, too.
For instance, this one.
And that one. That seems like it's plenty
good, because if something fails, we still have
a copy. But in fact, it's
not quite good enough because.
If one of those five fails, then we have
one copy left and we can be replicating that
again to get back to two copies.
But during that time, we
could lose another disk.
Now, the probabilities of that are much less
than losing a particular disk over a period
of a year or so.
But they are still large enough that we're going
to lose some data over a period of 100
years. Or put another way, we're going to lose
some fraction of a percent over a period
of one year. That, again, is a worst case
estimate, but it's a bad enough case that I
don't like it. So what's more common in
this replicated store storage style is we would
store on three machines.
So here these three machines each have a copy of
our data so that if we lose one machine
or disk, our data is not going to be lost.
Now, the problem with this, (and this is
very good) but it's and its performance is
excellent, but it's space efficiency is not great
because we have three times as much
disk devoted to storing our
data as we would like.
We would like to store our data once ideally
or at least store that many bits in some
arcane, tricky way.
So if we start again with our data, let's look
at that and let's this time split the data
into two halves.
And we can then compute a result, a
combination of those two halves, which is the
difference in some sense.
It's bits are on wherever the
bits are different in the originals.
Then we can store these pieces.
The "ta" there and the "da"
and the combination into the cluster.
Ta da! as they say.
We now have stored three half-sized pieces of our
data, so we have a 50 percent overhead.
That's a lot better than we had before with
a 200 percent or 100 percent overhead when we
had two or one extra copy of the data.
And if we have a failure,
like this disc has failed.
Right here. What we can do is we can
take the data that remains in the cluster.
This last half.
And the difference. And we can combine them
so that we apply those differences back to
the original data and recreate
the half that we lost.
We can then store that data back in the
cluster and we've recovered from our data loss.
So we can recover from one loss there.
And generally the way we write these things
down, the way we describe these encodings is
by how many original pieces did we have?
How much did we break
our original data into pieces?
And how many so-called parity blocks
or redundancy blocks did we store?
If we just store the original
data, we have one plus zero.
And of course, our failure,
tolerance there is zero.
And we can see this here.
The 1+0 has zero percent
overhead and zero failure tolerance.
It's not as good as we would like, of course.
And then 1+1, that's where we have one replica
of our data in addition to the original,
that has failure tolerance of
one but 100 percent overhead.
For 1+2 we have the same sort of thing,
but that's three copies of the data in the
cluster. And look, let's focus in
on two particular options here.
The options, 1+2, where
we triplicate the data,
three copies in the cluster and 4+2.
Four plus two is where we break the
original data into four pieces, compute two parity
blocks, not just using a symbol difference like
we did before, but a more advanced thing.
In this sort of encoding with four quarters and
two parity blocks, we now have only 50
percent overhead because the two parity blocks are
only half the size of the four
quarters combined. And we can take
two failures in the system.
So we have the same failure, tolerance as
triplication, but we've cut the overhead by a
factor of four. So, this takes 1.5 times
as much space as the original data.
That's good. We could even do better.
We could go to 6+2 or 8+2, and we're
going to narrow that overhead down further and
further. Overall, what we do, as a rule of
thumb, is for the triplication case (the 1+2
encoding), we just multiply the original data size
by four to get the raw disk size.
That's for triplication, plus some extra space
for the file system to operate efficiently
. In contrast, the six plus two case,
we just multiply the size times two.
Again, that's for the 30 percent overhead
or the twenty five percent overhead, depending
on what sort of encoding we're using, plus some
extra for the file system to work in.
Those are a little bit generous, but they're
good rules of thumb when sizing the system.
So these two encodings then, we can use.
Now in a HPE Ezmeral Data Fabric the way we
do this is that we use something called a
volume and a volume (we're gonna
draw it as a triangle here).
No real meaning to that other than it's different
from the way we're going to draw a
directory. Now directory is just what you think of
as a directory, but a volume if you're
using the system, if you're not administering it,
a volume, looks just like a directory.
And so we can build a hierarchical system.
There's a root volume up at the top and
it contains, as if it were just a directory,
directories and volumes (or volume mount
points as they are called).
And then those directories contain
directories and files and stuff.
And the volumes themselves also contain
volumes and directories and stuff.
So the volume is essentially a directory
with management superpower's that's in our data,
fabric implementation.
Now, if we look at this, there's a special way
we can set up the volume so that we have
the volume itself that's writing
data in a replicated fashion.
Typically one plus two encoding.
But there's a shadow volume in addition to
this volume that upon some activation of a
rule or explicit forcing.
And the rules are typically something like one
week after something is created or one day
after it's created, then this will happen.
What happens is we move the data from
the primary replicated version of the volume into
the shadow erasure coded version.
The effect of this is a shrinking.
The read cost is about the same.
At least, especially if we're saturating the disks in
a cluster or part of a cluster, we
can get a net aggregate read
speed that's about the same.
It does cost something to
do that, that encoding.
We have to read the original and we have
to write the parity blocks and we have to
compute those parity blocks.
And so there's a cost to doing this.
Let's look at how it turns out in practice.
Let's look at that system again.
So we've got the raw data coming
in one hundred terabytes per day.
But we're only going to
keep one to two hours.
So if we assume it's got some sort of peak
to valley ratio of like three to one, then
that raw data would be about 12 terabytes.
We wait for an hour and then we're going
to compress that data over a short period of
time. And we're gonna have 12 terabytes that
we're going to compress, plus some fraction
of the 12 from the next hour.
Let's call it 20 terabytes.
That's for the raw data coming in.
We're going to store that in a
volume that's replicated one plus two encoding.
Now, after that, we compress the data and
we can compress this sort of data.
This metrical data, because we're doing it an hour at
a time by sorting on an ID and that
will cause the compression to be
absolutely stellar, typically, 50 to one.
But let's just assume 20 to one.
So what's going to happen is the hundred terabytes
per day is going to get compressed down
to around five terabytes per
day of net new data.
And we'll keep that for about three months.
Call it one hundred days.
So we're gonna keep about 500 terabytes of
data at the end of those three months.
We're going to munch on that data and just
keep some kinds of aggregates that we can use
for certain kinds of historical analytics that
we can't use that for detailed examination
of what's happened.
And that long term archive is going to
be much smaller than the original data.
It's going to be like one
thousandth of the original data.
So about 50 gigabytes per day.
We're going to keep it for years.
Let's pretend three years is a thousand days.
These are round numbers, but that's
how we how we estimate sizes.
Now, if we draw a summary here,
this table shows how things break down.
The raw data, remember, it's one hour plus part
of the next hour, about 20 terabytes of
data as we think about it.
Going to encode that in
one plus two encoding.
So we're gonna multiply by four, three
times plus smush sort of thing.
The total, then, is about 80 terabytes.
The actual analytical data.
On the other hand, the data size is about
five hundred terabytes and we're to encode that
in four plus two.
So let's estimate then a 2x expansion.
As I mentioned, that's a bit
pessimistic, but it's good for sizing.
So the size that we
get is about one petabyte.
Much, much larger than the hot
data, the raw data coming in.
That's because even though it's compressed a lot,
we're keeping it a long, long time
relative to the one to two hours.
Finally, then, the archive we're going to
also encode in four plus two.
It's going to be kept for a very long
time, but it's very, very small relative to the
original data. So it'll
be about 100 terabytes.
Now, if you look at this, our size
here is completely dominated by the analytical data,
which is about a petabyte.
On the other hand, the data right volume is
very, very heavily dominated by the raw data.
If we write out this sort of summary, we
see that the space overhead, the total space
required to store all of the data relative to
the 4+2 encoding that we could do is about
four percent. The cost of keeping the raw
data in an inefficient encoding for performance
reasons is only about four percent.
On the other hand, the performance penalty of
doing the erasure coding (because we only
do the erasure coding on the compressed
data) is also right around five percent.
So we're getting the best of both.
We're getting the performance of the 1+2 encoding
and we're getting the space of the 4+2
encoding. It's a very odd situation in computing
where usually things go against us both
ways. Here it is going for us in both ways.
And the read speed overhead of reading the
analytical data, the stuff or the archive
data is near zero. and that's because
the original data is still there.
There's a single copy of
it plus parity blocks.
But we just read the originals.
We ignore the parity blocks, except when some sort
of hardware fails or we add disks or
we remove them or we change the encoding.
So in summary, we can have our cake and eat
it, too, in encoding, since we can have the
performance of triplication in a write sense and in
a read sense, and we can have the
space efficiency of erasure coding in the
same system on the same data ingestion
pipeline. This is really cool and it's all about
the lazy write technique that we use on
volumes, and the fact that you can
designate how volumes should be treated.
The fact that they can use the file
oriented age, creation time, and other parameters in
order to determine when to erasure code data.
That's information that's not available
at low level storage devices.
And so it would be much harder to do.
It is available as a hint in an
advanced, distributed storage system, a filesystem, a
data system, like the
HPE Ezmeral data fabric.
I'm Ted Dunning. I work in the
CTO office and this has been fun.
Let's do a lot more of this.
Thank you very much.
