>> Hey. I'm Tim Sander.
I'm here at Microsoft
Ready in Seattle.
Today, I'm going to talk
a little bit to you about
partitioning in Azure Cosmos DB.
[MUSIC].
>> So today, I'm going to talk
to you a little bit about
partitioning in Azure Cosmos DB.
So I've drawn a little diagram
here of a Cosmos container.
In other words, a
collection in Cosmos DB,
and it's a collection that has
three different physical partitions.
So first part of our talk together,
we're going to talk about
how partitioning works,
really go deep behind the scenes
and all that technical details,
and then at the end of the
session, we'll take a step
back and talk about what you as
a developer need to know in order to
partition your data
effectively in Cosmos DB.
So here again, we have a container in
Cosmos DB and we have three
different physical partitions.
We've provisioned 3,000 RUs on
your container in Cosmos DB.
So Cosmos DB achieves
its scaling magic essentially
through partitioning
and we're able to
do that by when you
scale up this amount or
use by basically adding on
partitions as needed to
accommodate that number.
So I'll give you a little
insight into how that works.
In this case, we have 3,000 RUs.
So that means each individual
physical partition
here gets a 1,000 RUs.
This is basically an individual
piece of physical hardware that's
in our data center and when
you provision throughput
on your container,
we automatically take that amount
of RUs that you provisioned
and divide it up among all of
the physical partitions
that you have.
If you let's say were
to raise this number
and go from 3,000-5,000,
we would increase the amount of
throughput assigned to
each physical partition.
This is all handled for you
automatically by Cosmos DB.
Now, in many cases,
you might need to scale this up
by a very significant number.
Let's say you might want
to scale it up from
3,000 all the way up to 36,000.
Now, each one of these
individual physical partitions
can have up to 10,000
RUs allocated to it.
So in this specific example,
three physical partitions would
not be enough partitions.
That wouldn't be a high
enough number of partitions
to accommodate 36,000 RUs.
So what Cosmos DB does
behind the scenes with
absolutely no impact to
your production workload it
all happens automatically,
is we split one of
the existing physical
partitions and then create
another one that has a subset
of the split partitions data.
So then this case if we had a 1,000
RUs in the previous example
before we scaled up,
now we have four partitions,
so we have 9,000 RUs on each
physical partition in that example.
This all happens behind
the scenes automatically.
You as a Developer using the product
actually really never need to
understand how this works.
But it's great that you're
watching this video
because this is an engineering
area in the product
that we're really quite
proud of because there's
no limit to the amount of
partitions that you can have.
You can add on as many
physical partitions as you'd
like and our biggest
customers have hundreds,
thousands of these and it
really scales quite well.
The next thing I want
to shift to is what
you as the Developer using Cosmos DB
need to know and what
you're going to control and
really need to understand
effectively to use the product.
Now, you as the Developer using
Cosmos DB are going to
specify this value here,
PK, which is basically my
abbreviation for partition key.
Your partition key is
going to dictate where
your data is stored among each
of these physical partitions.
Now, each of these
physical partitions,
as you saw before is a
subset of the resources
and it also has a subset of
the data allocated to it.
So the data on this
physical partition does not
overlap at all with the data
on this physical partition.
The partition key that
you set is going to
decide basically what
data is stored where.
Cosmos DB is going to store
one or more partition keys on
the same physical partition.
Your partition key is basically
just a property in
your document that you
designate to be the deciding
factor in how your data is stored.
So if for example,
I'm at a conference now,
let's imagine I have
a Cosmos DB container
that has data of attendees at
this Microsoft conference.
Attendees could have names,
they could have ages, and
they might have let's say
social security numbers, SSN.
So unique identifier per person.
For example, we could use
this unique identifier as the
partition key for each person.
So we'll just rename it PK to
keep this diagram nice and clean.
Based on this PK value
in the document,
we're going only to
basically decide what person
or what document basically is
stored on which physical partition.
We make the guarantee
that all the data within
a single partition key is going
to be stored on the same
physical partition.
So when splits like this
happen like you saw earlier,
we'll split an existing
physical partition,
but we will not split an
existing logical partition.
Cosmos DB in order to work
well needs to guarantee that
all that data remains on the
same individual partition key.
Each of these big guys here,
each of these big physical partitions
can scale up to as I
mentioned earlier,
10,000 RUs or 10 gigabytes of data.
That also means then because
each partition key cannot be
split across physical partitions,
each individual
partition key can scale
up to 10,000 RUs or
10 gigabytes of data.
So it's now that's a
nice transition into
the second part of our little
conversation together.
What are the best
practices of picking
this partition key that you want
as a developer using
Cosmos DB you really,
really want to be aware of?
Now, the first one that
you probably we're
guessing if you were really following
on closely, it was cardinality.
Each of these partition
keys can only grow to
10 gb and it can only
grow to 10,000 RUs.
So obviously, if it has
a high cardinality,
if there's many possible
values for this partition key,
there's essentially no
limit for how much you
can scale out in Azure Cosmos DB.
So best practice in almost every case
when you create a container
and pick your partition key,
it has to have a really
really high cardinality
because this is
essentially you're only
scaling bottleneck in Cosmos DB.
You can create as many partition keys
and as many partitions as you'd like.
So that's always key
for any workload.
It's also very very important that
this partition key, this guy here,
does an excellent job at evenly
distributing your reads and
writes and also storage.
You saw earlier that we basically
allocate throughput across
your physical partitions
and each one has
a subset of the data and a
subset of your throughput,
a subset of RUs.
So if all of your requests went
to this physical partition here,
we would be able to consume up to
9,000 RUs but we wouldn't be able to
borrow from the physical
partition next door.
So this individual partition
key can consume 9,000 RUs on
this physical partition and
all of these partition keys
together could consume
up to 9,000 RUs,
but they can't borrow from
the partition next to it.
So because of that, it's
very very important,
very very good if
your partition key does a good
job at evenly distributing
requests in your RU
consumption because that
ensures you're able to saturate
all your RUs available.
Obviously, the next
extension of that is
important for it to
balance storage as well.
Because if it balances the amount
of data stored per partition key,
it's naturally also going to
evenly distribute requests
most likely as well.
Those two things typically
tend to correlate strongly,
not always, but in most cases.
Very very important that
your partition key satisfies
those three properties;
high cardinality, evenly
distribute requests,
and evenly distribute storage.
That's important for any
workload on Azure Cosmos DB.
Now, for some workloads
that are very read heavy,
there's a fourth property of
a good partition key
that I'd like to add.
If you run a query in
Cosmos DB on this database
and this collection that
has data about different people at
the conference attendees
with name, age and PK.
If I ran a query that
filtered on let's say age.
Right now, age is not the
partition key of this container.
If I wanted to run a query
that filtered on age,
what I would need to
do is separately and
independently check
each partitions index.
When you run a query in
Cosmos DB that needs to
check each partition,
we're basically going to
run one query per partition
that checks the index
of that partition
independently of the others.
So we'll do this in parallel.
It'll be quite fast,
but it's not as efficient
as if you maybe partitioned
the container in such
a way that the query
knew which partition all
the data it needed was on.
So if you have a query and you can
maybe partition on the
filter of that query.
If you have a query
that you run very very
often and you can partition on
a property that's in
that query's filter
and it's an equality filter,
you can have the query
actually go only be routed to
the relevant partition
and that's going to make
the queries RU charge much lower.
Not relevant for all scenarios,
but for a really read heavy scenario,
that's a fourth optimization that we
like to throw in there
that can be very helpful.
In addition to that,
partition keys are
the logical boundary for stored
procedures in Cosmos DB.
That's also something if you're using
stored procedures to
keep in mind as well.
That's it, that's partitioning.
I hope this gave you some
insights into how it works behind
the scenes and also
some key takeaways for
what you as the Developer,
the Architect, the DBA that's
using Cosmos DB needs to
know about partitioning in order
to be successful. Thanks again.
